Support for working with regular expressions.
This exposes java.util.regex.Pattern as Regex and java.util.regex.MatchResult as MatchResult.
This module must have been imported during compilation or certain language features can not be used (see below).
Regular expression literals have type Regex and are written:
´\b(foo|bar)\b´ -- string enclosed in grave accents '\w+' -- string with length > 1 enclosed in apostrophes
The notation with the apostrophes has been introduced because many have a hard time entering a grave accent mark on their terminal. However, it is not possible to write a regular expressions with length 1 this way, because then the literal gets interpreted as Char literal. (One can write something like '(?:X)' for a Regex that matches a single 'X').
The advantage of using literals rather than compiling string constants to patterns at runtime is
1) they are checked by the Frege compiler, and illegal ones cause compilation to fail
2) at runtime, literals are available as immutable values of type java.util.regex.Pattern, not as strings, as can be seen so often in human written Java code. Thus the cost of repeated regex compilation is avoided.
3) The regular expression can be written directly, like in Perl or other languages, without taking String escape rules into account. That is, to match a $ sign, one writes
´\$´
and not
"\\$"
like in Java.
The Regex values created from the literals have the following flags enabled: Regex.canon_eq, Regex.unicode_case and Regex.unicode_character_class. That is, they do their best - as far as Java supports it - to get processing of Unicode strings right.
A Regex literal is a valid pattern, like any literal:
case expr1 of ´\w+´ → expr2
This is equivalent to:
case expr1 of s | s ~ ´\w+´ → expr2
where s is some fresh variable that doesn't appear in expr2.
The match succeeds if the String value resulting from evaluation of expr1 matches the regular expression given. The match diverges when expr1 diverges. In all other cases, the match fails.
Note that in this simple form, the value of expr1 is not available in expr2. One can use an "at" pattern for this:
case expr1 of string@'\w+' → expr2
which gets desugared to
case expr1 of string | string ~ '\w+' → expr2
leaving string bound to the value of expr1.
The following syntax not only checks if the regular expression matches but also binds the MatchResult:
case expr1 of m~'\w+' → expr2
It is equivalent to
case expr1 of s | Just m ← s =~ '\w+' → expr2
where s is again a fresh variable not known in expr2, but when the match succeeds m is bound to the MatchResult and can be inspected further in expr2.
And finally, a pattern like
s@m~'\w+'
works like the previous one but in addition binds s to the string value in question.
Regex values are based on Java's java.util.regex.Pattern objects, which are compiled representations of regular expressions.
We call them Regex and not Pattern in Frege, because the word "pattern" and the phrase "pattern matching" have a quite different meaning in deconstruction of algebraic data types.
There are 2 ways to obtain a compiled regular expression:
Regex values are immutable.
Enables canonical equivalence.
When this flag is specified then two characters will be considered to match if, and only if, their full canonical decompositions match. The expression ´a\u030A´, for example, will match the string "\u00E5" when this flag is specified.
There is no embedded flag character for enabling canonical equivalence.
Specifying this flag may impose a performance penalty.
Frege Regex literals will always be parsed with this flag.
Enables case-insensitive matching.
By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the Regex.unicode_case flag in conjunction with this flag.
Case-insensitive matching can also be enabled via the embedded flag expression (?i).
Specifying this flag may impose a slight performance penalty.
Permits whitespace and comments in pattern.
In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line.
Comments mode can also be enabled via the embedded flag expression (?x).
Returns either a PatternSyntaxException in Either.Left or a Regex in Either.Right.
The overloaded form has an additional Int argument which is a set of flags. Such a set can be constructed by adding up the individual flag values, for example:
Regex.compile "foo|bar" (Regex.canon_eq + Regex.case_insensitive)
Returns either a PatternSyntaxException in Either.Left or a Regex in Either.Right.
The overloaded form has an additional Int argument which is a set of flags. Such a set can be constructed by adding up the individual flag values, for example:
Regex.compile "foo|bar" (Regex.canon_eq + Regex.case_insensitive)
Returns either a PatternSyntaxException in Either.Left or a Regex in Either.Right.
The overloaded form has an additional Int argument which is a set of flags. Such a set can be constructed by adding up the individual flag values, for example:
Regex.compile "foo|bar" (Regex.canon_eq + Regex.case_insensitive)
Enables dotall mode.
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)
return the flags this pattern was compiled with
Enables literal parsing of the pattern.
When this flag is specified then the input string that specifies the pattern is treated as a sequence of literal characters. Metacharacters or escape sequences in the input sequence will be given no special meaning.
The flags Regex.case_insensitive and Regex.unicode_case retain their impact on matching when used in conjunction with this flag. The other flags become superfluous.
There is no embedded flag character for enabling literal parsing.
Enables multiline mode.
In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.
Multiline mode can also be enabled via the embedded flag expression (?m).
return the source of this Regex
java.util.regex.Pattern.split method
As this returns a String array, one usually uses Regex.splitted
Split a string around matches of a regular expression and return the result as list of strings.
Empty strings on the start of the string are preserved, but not those on the end of the string, e.g.:
´,´.splitted ",,a,b,c,," == ["", "", "a", "b", "c"]
Returns the string representation of this pattern.
This is the regular expression from which this pattern was Regex.compiled.
Enables Unicode-aware case folding.
When this flag is specified then case-insensitive matching, when enabled by the Regex.case_insensitive flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.
Unicode-aware case folding can also be enabled via the embedded flag expression (?u).
Specifying this flag may impose a performance penalty.
Because Frege Regex literals will always be parsed with the Regex.unicode_character_class flag, this flag is also implicitly enabled by default.
It can be disabled with the embedded flag expression (?-u).
Enables the Unicode version of Predefined character classes and POSIX character classes.
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.
The Regex.unicode_character_class mode can also be enabled via the embedded flag expression (?U).
The flag implies Regex.unicode_case, that is, it enables Unicode-aware case folding.
Specifying this flag may impose a performance penalty.
Frege Regex literals will always be parsed with this flag. It can be disabled with the embedded flag expression (?-U).
Enables UNIX lines mode.
In this mode, only the \n line terminator is recognized in the behavior of ., ^, and $.
Unix lines mode can also be enabled via the embedded flag expression (?d).
Will be thrown when one compiles a String to a Regex using Regex.compile or regforce
Alias for Regex.compile
This will throw PatternSyntaxException when the regular expression is illegal.
Use this only if you're sure that pattern compilation will not throw an exception, or if you don't care.
This will throw PatternSyntaxException when the regular expression is illegal.
Use this only if you're sure that pattern compilation will not throw an exception, or if you don't care.
This will throw PatternSyntaxException when the regular expression is illegal.
Use this only if you're sure that pattern compilation will not throw an exception, or if you don't care.
this is only used to inject the replace functions into String
Like Replace.replaceFirst, but replaces all substrings that match the pattern.
Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string.
This method scans the input sequence from the start looking for a match of the pattern. Characters that are not part of the match are appended directly to the result string; the match is replaced in the result by the replacement string. The replacement string may contain references to captured subsequences.
Note that backslashes (\\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
For example
replaceFirst "zzzdogzzzdogzzz" ´dog´ "cat" == "zzzcatzzzdogzzz"
Alias for Replace_String.replaceFirst
Alias for Replace_String.replaceAll
This interface contains query methods used to determine the results of a match against a regular expression.
More: JavaDoc
Returns the offset after the last character of the subsequence captured by the given group during the previous match operation where group 0 denotes the entire pattern.
If the specified capturing group failed to match, the return value will be -1.
The following property holds:
(m.group n == Nothing) ==> (m.end n < 0)
group m n
retrieves the input subsequence captured by the given group index during the previous match operation.
Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression (m.group 0) retrieves that portion of the input string that was matched by the pattern.
If the match was successful but the group specified failed to match any part of the input sequence, then Maybe.Nothing is returned.
Note that some groups, for example (a?), match the empty string. This function will return Just "" when such a group successfully matches the empty string in the input.
The result will also be Maybe.Nothing when the specified group does not exist, that is, when the Int argument is outside the range [0..MatchResult.groupCount m]
The following property holds for a MatchResult m with input sequence s and valid group index g:
isJust (m.group g) ==> (m.group g) == Just (s.substr (m.start g) (m.end g))
Returns the number of capturing groups in this matcher's pattern.
Group zero denotes the entire pattern by convention. It is not included in this count.
Any non-negative integer smaller than or equal to the value returned by this method is guaranteed to be a valid group index for this matcher.
The list of subsequences captured during the previous match operation, or the empty list if there was no match.
To be used like:
groups $ s =~ regex
See also =~~
MatchResult.match m returns the input subsequence matched by the previous match. The result is undefined if the last match was not successful.
For a MatchResult m with input sequence s, the following holds:
isJust (m.group 0) ==> unJust (m.group 0) == m.match
Note that some patterns, for example a?, match the empty string. This method will return the empty string when the pattern successfully matches the empty string in the input.
Returns the start index of the subsequence captured by the given group during the previous match operation where group 0 denotes the entire pattern.
If the specified capturing group failed to match, the return value will be -1.
The following property holds:
(m.group n == Nothing) ==> (m.start n < 0)
string =~ regex
tries to match string against regex and returns Just matcher if it succeeds, Nothing otherwise.
findAt string regex offset
Like =~, but matches only the substring of string that starts at offset against regex.
If offset is negative or not smaller than the length of string, the result is Nothing
string =~~ regex
tries to match string against regex and returns a list of all captured substrings, see MatchResult.groups.
Example:
case "today is 2014-02-18" =~~ ´(\d\d\d\d)-(0[1-9]|1[012])-(\d\d)´ of [Just date, Just year, Just month, Just day] -> ... baddate -> ....
string ~ regex
true if string matches regex, false otherwise
s !~ p == !(s ~ p)
("string" ~~ ´r..´) == Just "rin"
Tries a match and returns Just x where x is the matched substring or Nothing if there was no match.
string ~~~ regex
Matches string with regex and returns a function that can be used to extract the matched part of the string and the captured substrings.
let f = "frege" ~~~ ´(..).(..)´ in [ f i | i <- 0..3 ]
yields
[Just "frege", Just "fr", Just "ge", Nothing]
"cats and dogs are not concatenated." ~~* ´cat|dog´
Matches the string with the regex and returns a list of all matches. The example gives:
["cat", "dog", "cat"]
Like String.replaceFirst, but replaces all substrings that match the pattern.
Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string.
This method scans the input sequence from the start looking for a match of the pattern. Characters that are not part of the match are appended directly to the result string; the match is replaced in the result by the replacement string. The replacement string may contain references to captured subsequences.
Note that backslashes (\\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
For example
"zzzdogzzzdogzzz".replaceFirst ´dog´ "cat" == "zzzcatzzzdogzzz"
Regex.unix_lines, Regex.unicode_case, Regex.unicode_character_class, Regex.multiline, Regex.dotall, Regex.literal, Regex.comments, Regex.canon_eq, Regex.case_insensitive