Skip to main content

Regular Expression Study Notes

Free2015-03-07#Regular_Expression#Topic#正则表达式#正则环视#正则引擎#正则优化

After spending a month reading 500 pages of theoretical knowledge, I am sharing my study notes here. Regular expressions are not just a simple table of metacharacters, and the related knowledge cannot be encapsulated in a single thin article. I hope this article can help you deepen your understanding of Regex while broadening your horizons; on the other hand, it serves as a note-to-self for future reference. Specific content includes [How Regular Expressions Work], [Regex Lookaround], [Regex Engines], [Regex Optimization], etc.

Foreword: (A bit of a tangent, Click me to skip>>)

As mentioned in the summary, regular expressions are a vast knowledge system, not just a simple table of metacharacters, nor something that can be explained in a few words.

Someone once commented, "...if in the history of computer development to date, some great things have appeared, Regular Expressions (Regex) would be one of them, alongside the Web, Lisp, Hash algorithms, UNIX, Relational Models, and Object-Oriented programming; but such things definitely do not exceed 20 items..."

Saying this might still not be enough to catch your attention, because although you've heard of Regex and can understand existing expressions by looking at a metacharacter table, you rarely use Regex in actual development...

Indeed, so is Regex still alive? Where has it gone?

The answer is that Regex has permeated our programming languages, operating systems, and related applications. For example, many high-level languages provide methods similar to String.find(), and many operating systems provide file content search commands (like Linux's grep command), all of which are related to regular expressions.

So, since Regex has "disappeared" (permeated), is it still necessary to learn it? Of course. Regular Expression is a technology, and understanding the significance of a technology is far greater than mastering a tool.

Table of Contents

1.How Regular Expressions Work

2.Regex Engines

3.Regex Lookaround

4.Backtracking

5.Optimization of Regular Expressions

6.How to Write Efficient Regular Expressions?

7.A Few Common Mistakes

8.Summary

9.Appendix [Metacharacter Table] [Mode Control Character Table] [Special Metacharacter Table]

I. How Regular Expressions Work

The specific process of applying a regular expression to a target string is as follows:

  1. Regular Expression Compilation

    Check the syntax correctness of the regular expression; if correct, compile it into an internal form.

  2. Transmission Start

    The transmission device "positions" the regex engine at the starting position of the target string.

    P.S. To briefly explain "transmission," it is a mechanism within the regex engine. For example, applying [abc] to the string "family": first, try "f" at the first position, fail; then move to "a" at the second position, success, and matching ends. Note: who is controlling this "bit-by-bit" processing (first position, then second position after failure...)? Correct, it's the so-called transmission device.

  3. Element Detection

    The regex engine begins attempting to match the regular expression and the text. This involves not only moving forward bit-by-bit but also a backtracking process (backtracking is a key point and will be explained in detail later).

  4. Obtain Matching Result

    Determine the matching result, success or failure. The specific process depends on the type of regex engine; for example, returning success upon finding the first complete match, or continuing to search for the longest qualified string after finding the first one.

  5. Driving Process

    If no suitable match is found at the current position, the transmission device drives the engine to start a new round of attempts from the next character after the current position.

  6. Total Match Failure

    If the transmission device drives the engine to the end of the specified string and still finds no suitable match, the match is declared a failure. (Simply put, it's a failure if nothing matches from start to finish. The description here is somewhat technical to stay closer to the internal principles.)

II. Regex Engines

The so-called regex engine types are actually a classification. As mentioned earlier, Regex is a technology. Anyone can use it to solve problems, but everyone's approach is different. In other words, the specific implementations and rules of regular expressions vary. Through long-term development, several schools of thought have eventually formed, each promoting different rules.

Common schools (regex engine types) include the following:

  1. NFA (Non-deterministic Finite Automaton, don't worry about this strange name...)
  2. DFA
  3. POSIX NFA
  4. Hybrid DFA/NFA

We don't need to know the classification criteria for each engine; we just need to understand the differences between them and which classification the tools we use belong to. It's very simple:

  1. NFA

    Tools of this type: Java, GNU Emacs, grep, .NET, PHP, Python, Ruby, etc.

    Difference: NFA can be called a regex-directed engine, because its matching efficiency is closely related to the regular expression (e.g., the order of alternation branches in the expression).

  2. DFA

    Tools of this type: awk, egrep, flex, lex, MySQL, Procmail, etc.

    Difference: DFA is called a text-directed engine. Its matching efficiency is only related to the text (target string). Equivalent expressions in different forms have the same efficiency, such as [a-d] and [abcd]. Note that in NFA, the efficiency of these two is different; generally, the former is better.

  3. POSIX NFA

    Tools of this type: mawk, Mortice Kern System's utilities, etc.

    Difference: Whether the match is successful or not, it must try all possibilities to find the longest possible match.

  4. Hybrid DFA/NFA

    Tools of this type: GNU awk, GNU grep/egrep, Tcl

    Difference: This type of engine is arguably the best and most mature. Internal engine optimizations are relatively complete, combining the advantages of both DFA and NFA. However, few tools currently use this type of engine.

After saying all this, what we really need to know is:

Before using a tool that supports Regex, it is extremely important to know which type its engine belongs to. This is because different engines have different working mechanisms. For example, PHP's three regex libraries are all NFA-type, where matching is closely related to the expression, so I should logically optimize the expressions to improve efficiency.

III. Regex Lookaround

[Actually, there's no need to list this separately as it's just a small part of regular expressions. However, since some people don't know "lookaround" and others have heard of it but don't understand it and think it's very profound... I'll discuss it separately (it's definitely not difficult).]

1. What is "Lookaround"?

Literally, "lookaround" means looking in all directions. Regex lookaround follows the same principle: drive to a position and first look left and right to see if this position is what we are looking for.

For example, using (this|that) to match "there is a boy lying under that tree." Obviously, this expression is very inefficient under an NFA engine. It works like this:

  1. First, encountering the first "t", check "this" bit-by-bit; find "i" doesn't match "e", then check "that" bit-by-bit; find "a" doesn't match "e";
  2. Drive forward one position to "h", check "this" bit-by-bit... check "that" bit-by-bit...;
  3. Drive forward one position to "e", .......
  4. ...

A lot of useless work was done. So how to optimize it?

You can extract the prefix (one of the common optimization methods, summarized later) and turn it into th(is|at).

Of course, since we're discussing lookaround here, let's solve it using lookaround: (?=th)(this|that). Oh, what if you don't understand the preceding (?=)?

No problem, this is a positive lookahead. It means: I move forward from the start, and when I encounter "th", I stop and compare the part of the expression after (?=th)—(this|that). [Note: conversely, if "th" is not encountered, it doesn't stop and continues moving forward... does the efficiency change a bit?]

After optimization, the number of comparisons is significantly reduced. Of course, using lookaround here might seem a bit overkill; we're just giving a simple example of applying lookaround, so don't take it too seriously.

2. Types and Roles of Regex Lookaround

TypeRegular ExpressionCondition for Successful Match
Positive Lookahead(?=...)Subexpression can match text to the right
Positive Lookbehind(?<=...)___________________left______
Negative Lookahead(?!...)Subexpression cannot match text to the right
Negative Lookbehind(?<!...)___________________left______

P.S. The "right" and "left" above refer to the sides of the current position where matching occurs, which is different from general matching. For example:

Matching the string "family" with the positive lookahead (?=a)abc: the initial position is before the "f", not at the position of "f". Why is that?

Because [ lookaround structures do not match any characters; they only match specific positions in the text. ]. If the current position is between "f" and "a", the positive lookahead succeeds and bit-by-bit detection of "abc" begins.

We find that positive lookahead can restrict the position where comparison actually starts, thereby reducing the number of attempts.

3. Applications of Lookaround

Lookaround is mostly used for expression optimization and other special occasions (where it's indispensable; though generally, lookaround can be replaced by other more complex structures).

For example, to match the word "the" in "the land belongs to these animals", how to avoid matching the "the" in "these"?

We can easily think of word boundaries (if the engine supports them). Just use \bthe\b for global matching.

Actually, for this example, we can also use the(?!\w) to achieve the goal. Even if the preceding "the" matches the "the" in "these", the following negative lookahead (?!\w) will exclude "these" (here the negative lookahead restricts that "e" cannot be followed by a word character; specifically, \w is equivalent to [a-zA-Z0-9], which might not be perfectly suitable here but serves to illustrate the point).

IV. Backtracking (A key issue before mentioning optimization)

Simply put, backtracking is retreating to an untried branch (or returning to a saved state. The first explanation is easier for beginners, while the second is more accurate).

For example, using .*! to match the string: "An idle youth, a needy age!", an old saying said.

  1. First, * modifying . can match any number of characters (the dot represents any character, * represents any quantity). Moreover, * is greedy (it matches as long a string as possible).
  2. So .* matches the entire string (from "A" to "."). At this point, detection finds that "!" cannot match. What to do?
  3. The string matched by .* must give up a portion to allow "!" a chance to match. It gives up the dot at the end, but "!" still doesn't match.
  4. Continue giving up characters; next is "d", still no match.
  5. ...
  6. When the "!" after "age" is given up, the match succeeds.

In this entire process, from the time .* consumes the entire string until it is forced to give back characters to match "!", the action being performed is backtracking (simply put, the engine's drive is moving backward).

Backtracking like this is obviously meaningless and time-consuming. A large part of the optimization we do is to reduce the number of backtracks.

From another perspective, reducing backtracking improves matching efficiency, or shortens the time from the start of the engine's work to the feedback of the matching result (success/failure). Isn't that exactly what optimization is?

V. Optimization of Regular Expressions

  1. Efficiency Indicators

    When evaluating the efficiency of a regular expression, there are two main indicators: the number of attempts (comparisons) and the number of backtracks.

    Assuming the correctness of the expression, the fewer the attempts and backtracks, the better. Fewer counts mean finding a suitable match faster (or failing a match faster).

  2. Optimization Operations

    Optimization operations move in two directions:

    1. Accelerating Certain Operations

      This needs to be considered in conjunction with specific internal engine implementations. For example, generally under NFA engines, [\d] is faster than [0-9], and [0-9] is faster than [0123456789].

    2. Avoiding Redundant Operations

      This means precise restriction. For example, the regex lookaround example mentioned earlier: restricting the match start position can greatly improve efficiency.

      Of course, trade-offs are needed when performing such optimizations. If a significant amount of time is spent restricting positions but the matching efficiency decreases, then such optimization is undesirable.

    Whether to optimize and to what extent requires balancing based on specific application scenarios.

  3. Common Optimization Methods

    There are many optimization methods; here we only list some of the most common ones (interested readers can refer to relevant books):

    1. Eliminate Unnecessary Parentheses

      In many cases, adding () is only to limit the scope of quantification rather than to capture matched text. In such cases, non-capturing parentheses (?:) should be used instead of capturing parentheses (). This not only reduces memory overhead but also significantly improves efficiency.

    2. Eliminate Unnecessary Character Classes

      Some people are used to using character classes like [.] to represent a single special character. This can actually be replaced with \.. Similarly, [*] -> \*, and so on.

    3. Avoid Repeated Compilation

      This refers to considerations when applying Regex in other tools. For example, using Java to apply a regular expression to text: the expression needs to be compiled first. A regular expression only needs to be compiled once, so the compilation part should not be inside a loop to avoid repeated compilation and save extra time.

    4. Use Starting Anchors

      This is a good habit to develop. For example, most regular expressions starting with .* can be preceded by ^ or \A to represent the beginning of a line or paragraph. What are the benefits of doing this?

      In some older engines, the effect of this optimization is very significant. Imagine if .* tries the target string and finds no suitable match. If there is no ^ or \A before the expression, the engine will start a new round of attempts from the second character of the target string... Obviously, this is pointless (we know the result after one round; there's no need for a 2nd or nth round).

      Some more mature engines can automatically optimize such expressions. If they detect an expression starting with .* without ^ or \A, the engine will automatically add a starting position flag to avoid pointless attempts.

      For us, adding a starting anchor before .* should become a habit.

    5. Isolate Literal Text

      For example, [xx*] is faster than [x+], x{3, 5} is not as fast as xxxx{0, 2}, and th(?:is|at) is faster than (?:this|that).

VI. How to Write Efficient Regular Expressions?

Writing regular expressions should follow these steps:

  1. Match expected text
  2. Exclude unexpected text
  3. Easy to control and understand
  4. Ensure efficiency and obtain results as quickly as possible (success/failure)

The first two points ensure correctness, while the latter two require a proper trade-off between efficiency and usability. These are the principles of writing regular expressions.

There is a classic saying that basically explains the general principle: Don't throw the baby out with the bathwater.

VII. A Few Common Mistakes

  1. Differences Between [-./], [.-/], and [./-]

    At first glance, they seem the same. In fact, the first and third are equivalent, meaning the character at the current position must be a hyphen, a dot, or a slash.

    The second expression is incorrect. It means the character at the current position must be any character between the dot and the slash (simply put, the "-" here represents a range, similar to [a-z]). However, what characters exist between the dot and the slash depends on the character set environment. In Unicode, many strange characters could appear, which contradicts our original intention.

    Therefore, when using "-" in a character class, you must carefully check its position to avoid such errors.

  2. Differences of ^ Inside and Outside []

    Outside, ^ represents the start of a line and $ represents the end. Inside, ^ represents negation ([^...] is the so-called negated character class) or a literal character ([...^]).

  3. Difference Between [ab]* and (a*|b*)

    They seem equivalent, but there is a special case: the former can match "aba" while the latter cannot. Furthermore, the former is more efficient.

  4. Common Mistakes When Using Quantifiers (?+*)

    When nested quantifiers are present, the semantics should be carefully considered to avoid causing a loop (infinite backtracking). For example, using "(\\.|[^\\"]+)*" to match a quoted section in text where the quoted part can include escaped double quotes. This expression will cause a loop and almost never yield a result.

    However, nested quantifiers do not always lead to loops. In short, one should be extremely cautious when nested quantifiers appear in an expression.

VIII. Summary

My personal view on regular expressions is:

If your understanding of Regex is not thorough, try not to use it to solve complex problems (or apply very long regular expressions). The pitfalls within can be baffling. Constructing a perfect regular expression requires very rigorous thinking. In general applications, matching strings through programmatic logic is easier to control.

Of course, I'm not saying you shouldn't use Regex at all (don't "stop eating for fear of choking"). One must admit that in certain situations, Regex has irreplaceable and magical effects (such as extracting URLs from text...).

Even if you don't use it yourself, you should fully understand regular expressions because others will, and you will eventually encounter them.

### Reference Matching

An uncommon usage that allows referencing parts already matched within the regular expression. Example:

// js code
var regex = /(\w{2,4}).+\1.+\1/i;
console.log(regex.test('qwer11qwe213234qw'));   // true
console.log(regex.test('qwer11qwe213234q'));    // false

Used to detect strings containing unknown repeating sequences (where only the number of repetitions is known, not the sequence itself). In the example above, the repeating sequence is qw.

IX. Appendix [Metacharacter Table] [Mode Control Character Table] [Special Metacharacter Table]

1. Metacharacter Table (Providing metacharacters commonly supported by most tools)

MetacharacterNameMeaning
^CaretIndicates start of line position
$Dollar signIndicates end of line position
.DotRepresents any character (generally cannot represent line-ending \n)
[]Character classRepresents any one of the characters inside (must match one character)
[^]Negated character classRepresents any character except those inside (must match one character)
\charEscape characterIndicates another meaning for "char"; e.g., \^ represents a literal "^" rather than start of line
()(Capturing) ParenthesesRepresents scope of quantifier or captures matched text (can be accessed in back-references)
(?:)Non-capturing parenthesesSame function as parentheses but does not capture text
?Question markQuantifier, indicates the preceding part is optional
*AsteriskQuantifier, indicates the preceding part can appear any number of times (or not at all)
+Plus signQuantifier, indicates the preceding part appears at least once, with no upper limit
{min, max}IntervalQuantifier, indicates the preceding part appears at least "min" times and at most "max" times
{num}Special intervalQuantifier, indicates the preceding part must appear exactly "num" times
|Vertical barRepresents OR, used for alternation structures
\<Word boundaryIndicates start of word position
\>Word boundaryIndicates end of word position
\numBack-referenceRepresents text captured by the nth capturing group (counting is based on the order of opening parentheses, including nested ones)

2. Mode Control Character Table (Providing examples of mode control characters; may vary by tool)

Control CharacterMeaning
iCase-insensitive matching
gGlobal match, finds all matching parts in target text; default finds only the first
xExtended/Freespacing, regex can span multiple lines and include comments
mMultiline mode, splits text into logical lines so ^ and $ match respective positions on each line rather than start/end of whole string
sSingle-line/Dotall mode, where the dot matches any character (default dot matches anything except newline)

3. Special Metacharacter Table (Providing special metacharacters supported by certain tools)

MetacharacterMeaning
\dDigit, equivalent to [0-9]
\DNon-digit character, equivalent to [^0-9]
\wAlphanumeric, equivalent to [a-zA-Z0-9]
\WNon-alphanumeric, equivalent to [^a-zA-Z0-9]
\sWhitespace character, e.g., space, tab, form feed, carriage return, newline, etc.
\SNon-whitespace character
\bWord boundary, indicates start or end of word position
(?>...)Atomic group, does not give up matched characters; e.g., (?>\w+!) cannot match "Hi!"
??, +?, *?, {min, max}?Lazy/Reluctant quantifiers, match as little as possible (shortest match while still allowing a full match)
?+, ++, *+, {min, max}+Possessive quantifiers, semantic same as atomic groups

Disclaimer: All content above is based on the author's understanding of the reference books.

Reference book: "Mastering Regular Expressions" (by Jeffrey E.F. Friedl)

Review: This book is excellent in terms of chapter progression, emphasis on content, and even layout (unique layout: all thought experiments posed in the book require turning the page to see the answer). It is very helpful for a deep understanding of Regex; interested friends should check it out.

Comments

No comments yet. Be the first to share your thoughts.

Leave a comment