Major programming languages often include support for regular expressions - either directly in the language (like in Perl - which even coined the word “Perl compatible regular expression” for a family of RE languages) or as a part of the standard library (like Java’s java.util.regex package). Other languages have one or even multiple regex packages available as add-on libraries.
In theory, a regular expression is some way of describing a regular language, i.e. a language which can be recognized by a finite automaton. (In practice, some regular expression implementations include features which can make the recognized language non-regular, like back references.) Important for regular expressions are:
- literals (i.e. an expression matching exactly one specific input character, like
a in most RE implementations)
- repetition (the
* or + operator) of arbitrary expressions
- alternatives
| between arbitrary expressions
- grouping (
( and )) is used to make the syntax unambiguous while still allowing all these constructs to live together.
Most other features (like character classes) can be combined from these. Often the groups can also be used for capturing parts of the result, and/or for reusing these results later in the same regular expression (which in fact makes the expression non-regular, as we need more than a finite automaton to implement these).
Here is an example expression: \b[0-9]*\.?\b[0-9]?(?!]). It matches a word boundary, zero or more digits, one or no dot and another word boundary, zero or one digit - but not when followed by a closing bracket. (It is said to match “decimal numbers” without following ], but it doesn’t really match all of them, and it also matches the empty string at most word boundaries (when not followed by ]).
Lua is a scripting language with focus on easy embedding in other programs. For this reason, it is quite small - smaller than most regular expression engines. Thus, naturally if does not include such an engine.
It has its own replacement feature, named simply patterns in the manual.
These are three-leveled:
Character classes,
- literal ones (
a matches a, %. matches .),
- some pre-build ones (
%l for any letter, . for any single character),
- sets (
[a-z] represents any single one of the letters a to z).
pattern items:
- a character class, optionally followed by one of the repetition modifiers
?, *, +, -
- a reused captured item (
%3 matches the 3rd captured string)
- a balanced group (
%b() matches a substring starting with (, ending with )
and balanced parenthesis between them)
patterns: A sequence of
- pattern items
(-)-enclosed patterns (these represent captures which can be used later in the pattern)
- empty captures
() (they capture the current index into the string instead of a substring)
Note that there are:
- no alternatives except for single characters/classes
- no repetitions except for single characters classes
This makes it impossible to recognize the languages of some quite simple regular expressions:
ab|ba (either ab or ba - this is even a finite language)
(ab)* (a string consisting of any number of ab, but no aa nor bb.)
For our example language, we could create those patterns:
%d+%.?%d* matches all decimal numbers with at least one digit before the dot (or without a dot),
%d*%d.?%d+ matches all decimal numbers with at least one digit after the dot (or without a dot).
%.%d+ matches decimal numbers without a digit before the dot.
Regular expressions would allow to compose them to match what we want, but Lua patterns don’t (no alternatives). Thus the alternative has to be implemented externally, using Lua code.
On the other hand, the %b feature allows recognizing languages which are not regular (if we ignore that this is most probably implemented with a finite-size int counter), like this simple one:
%b() - any ()-balanced string (with any other character interspersed in it).
With pure RE (in the theoretical sence), this is impossible, since we need at least a stack automaton (or a counting one) for this. Perl’s RE implementation (in version 5.14) allows recursion, so here it would look like \(([^()]*(?R))*[^()]*\) (the (?R) part recurses to the main pattern). Java does not support this.
Of course, often you will need to restrict the contents of your ()-balanced string a bit more
than only “must be balanced and can contain anything else”, and then the Lua patterns
won’t help you. Write a real parser (or use Perl’s recursive feature).
Likewise, the %1 feature (use captured group started by first () is not possible in pure RE, but most real life RE implementations have it, in the form \1 in PCRE (i.e. both Perl and Java).
Another feature used in the example regular expression above is the \b zero-width assertion. It matches at a border between word characters (letters, digits and similar characters) and non-word-characters (like space, punctuation, …).
Lua seems to have nothing like this, if reading only the manual.
But when I pointed this out in my answer, jpjacobs mentioned the frontier pattern in a comment.
This is a generalization of the regular expression \b: %b takes a set (in []) as argument, and matches where there is a transition from not in set to in set. For example %f[%.%d] matches at the boundary between not part of a decimal digit to part of a decimal digit.
With this, I could create the expression %f[%.%d]%d*%.?%d*%f[^%.%d%]], which matches all decimal numbers which are preceded by something that is neither digit nor dot (or by nothing), and followed by something that is neither ] nor digit nor dot (or by nothing). It also matches the single dot, though - there seems to be no way to allow both .1 and 1. without allowing either the single dot or strings with two dots like .1..