Module Java.util.Regex

Support for working with regular expressions.

This exposes java.util.regex.Pattern as Regex and java.util.regex.MatchResult as MatchResult.

This module must have been imported during compilation or certain language features can not be used (see below).

Frege language support

Regex literals

Regular expression literals have type Regex and are written:

 ´\b(foo|bar)\b´       -- string enclosed in grave accents
 '\w+'                 -- string with length > 1 enclosed in apostrophes

The notation with the apostrophes has been introduced because many have a hard time entering a grave accent mark on their terminal. However, it is not possible to write a regular expressions with length 1 this way, because then the literal gets interpreted as Char literal. (One can write something like '(?:X)' for a Regex that matches a single 'X').

The advantage of using literals rather than compiling string constants to patterns at runtime is

1) they are checked by the Frege compiler, and illegal ones cause compilation to fail

2) at runtime, literals are available as immutable values of type java.util.regex.Pattern, not as strings, as can be seen so often in human written Java code. Thus the cost of repeated regex compilation is avoided.

3) The regular expression can be written directly, like in Perl or other languages, without taking String escape rules into account. That is, to match a $ sign, one writes

 ´\$´

and not

 "\\$"

like in Java.

The Regex values created from the literals have the following flags enabled: Regex.canon_eq, Regex.unicode_case and Regex.unicode_character_class. That is, they do their best - as far as Java supports it - to get processing of Unicode strings right.

Regex literals in pattern matching

A Regex literal is a valid pattern, like any literal:

 case expr1 of
   ´\w+´  →  expr2

This is equivalent to:

 case expr1 of
   s | s ~ ´\w+´ →  expr2

where s is some fresh variable that doesn't appear in expr2.

The match succeeds if the String value resulting from evaluation of expr1 matches the regular expression given. The match diverges when expr1 diverges. In all other cases, the match fails.

Note that in this simple form, the value of expr1 is not available in expr2. One can use an "at" pattern for this:

 case expr1 of
    string@'\w+'  →  expr2

which gets desugared to

 case expr1 of
   string | string ~ '\w+'  →  expr2

leaving string bound to the value of expr1.

The following syntax not only checks if the regular expression matches but also binds the MatchResult:

 case expr1 of
   m~'\w+'  →  expr2

It is equivalent to

 case expr1 of
   s | Just m ← s =~ '\w+'  →  expr2

where s is again a fresh variable not known in expr2, but when the match succeeds m is bound to the MatchResult and can be inspected further in expr2.

And finally, a pattern like

 s@m~'\w+'

works like the previous one but in addition binds s to the string value in question.

Imports

import Java.Lang as Lang
import Prelude.PreludeArrays as PreludeArrays
import Prelude.PreludeBase as PreludeBase
import Prelude.PreludeIO as PreludeIO
import Prelude.PreludeList as PreludeList
import Control.Semigroupoid as Semigroupoid

Table of Content

Definitions

class Replace s

data MatchResult = pure native java.util.regex.MatchResult

data PatternSyntaxException = native java.util.regex.PatternSyntaxException

data Regex = pure native java.util.regex.Pattern

regcomp ∷ String → (PatternSyntaxException | Regex) String → Int → (PatternSyntaxException | Regex)

regforce ∷ String → Regex String → Int → Regex

regforce𝖆 ∷ String → Regex   overloads regforce

regforce𝖇 ∷ String → Int → Regex   overloads regforce

substituteFirst ∷ String → Regex → String → String

substituteAll ∷ String → Regex → String → String

=~ ∷ String → Regex → Maybe MatchResult

findAt ∷ String → Regex → Int → Maybe MatchResult

=~~ ∷ String → Regex → [Maybe String]

~ ∷ String → Regex → Bool

!~ ∷ String → Regex → Bool

~~ ∷ String → Regex → Maybe String

~~~ ∷ String → Regex → Int → Maybe String

~~* ∷ String → Regex → [String]
Instances

instance Exceptional PatternSyntaxException

instance Replace String
Functions and Values by Type

Definitions

data Regex = pure native java.util.regex.Pattern

Regex values are based on Java's java.util.regex.Pattern objects, which are compiled representations of regular expressions.

We call them Regex and not Pattern in Frege, because the word "pattern" and the phrase "pattern matching" have a quite different meaning in deconstruction of algebraic data types.

There are 2 ways to obtain a compiled regular expression:

Use a Regex literal, see the language reference section 2.7.3. All regular expression literals are values of type Regex. It is not possible to run a program which contains a syntactically invalid Regex literal - the compiler will flag it already at compile time. Hence, with Regex literals one is immune against PatternSyntaxExceptions.
Compile a String with Regex.compile or regforce at run-time. Illegal regular expression will cause PatternSyntaxException to be thrown, which, when not handled appropriately, may crash the program.

Regex values are immutable.

Member Functions

canon_eq ∷ Int
pure native java.util.regex.Pattern.CANON_EQ

Enables canonical equivalence.

When this flag is specified then two characters will be considered to match if, and only if, their full canonical decompositions match. The expression ´a\u030A´, for example, will match the string "\u00E5" when this flag is specified.

There is no embedded flag character for enabling canonical equivalence.

Specifying this flag may impose a performance penalty.

Frege Regex literals will always be parsed with this flag.

case_insensitive ∷ Int
pure native java.util.regex.Pattern.CASE_INSENSITIVE

Enables case-insensitive matching.

By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the Regex.unicode_case flag in conjunction with this flag.

Case-insensitive matching can also be enabled via the embedded flag expression (?i).

Specifying this flag may impose a slight performance penalty.

comments ∷ Int
pure native java.util.regex.Pattern.COMMENTS

Permits whitespace and comments in pattern.

In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line.

Comments mode can also be enabled via the embedded flag expression (?x).

compile ∷ String → (PatternSyntaxException | Regex)
| String → Int → (PatternSyntaxException | Regex)
pure native java.util.regex.Pattern.compile

compile a String to a Regex

Returns either a PatternSyntaxException in Either.Left or a Regex in Either.Right.

The overloaded form has an additional Int argument which is a set of flags. Such a set can be constructed by adding up the individual flag values, for example:

  Regex.compile "foo|bar" (Regex.canon_eq + Regex.case_insensitive)

compile𝖆 ∷ String → (PatternSyntaxException | Regex)
pure native java.util.regex.Pattern.compile overloads compile

compile a String to a Regex

Returns either a PatternSyntaxException in Either.Left or a Regex in Either.Right.

The overloaded form has an additional Int argument which is a set of flags. Such a set can be constructed by adding up the individual flag values, for example:

  Regex.compile "foo|bar" (Regex.canon_eq + Regex.case_insensitive)

compile𝖇 ∷ String → Int → (PatternSyntaxException | Regex)
pure native java.util.regex.Pattern.compile overloads compile

compile a String to a Regex

Returns either a PatternSyntaxException in Either.Left or a Regex in Either.Right.

The overloaded form has an additional Int argument which is a set of flags. Such a set can be constructed by adding up the individual flag values, for example:

  Regex.compile "foo|bar" (Regex.canon_eq + Regex.case_insensitive)

dotall ∷ Int
pure native java.util.regex.Pattern.DOTALL

Enables dotall mode.

In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)

flags ∷ Regex → Int
pure native flags

return the flags this pattern was compiled with

literal ∷ Int
pure native java.util.regex.Pattern.LITERAL

Enables literal parsing of the pattern.

When this flag is specified then the input string that specifies the pattern is treated as a sequence of literal characters. Metacharacters or escape sequences in the input sequence will be given no special meaning.

The flags Regex.case_insensitive and Regex.unicode_case retain their impact on matching when used in conjunction with this flag. The other flags become superfluous.

There is no embedded flag character for enabling literal parsing.

multiline ∷ Int
pure native java.util.regex.Pattern.MULTILINE

Enables multiline mode.

In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.

Multiline mode can also be enabled via the embedded flag expression (?m).

pattern ∷ Regex → String
pure native pattern

return the source of this Regex

split ∷ Regex → String → JArray String
pure native split

java.util.regex.Pattern.split method

As this returns a String array, one usually uses Regex.splitted

splitted ∷ Regex → String → [String]

Split a string around matches of a regular expression and return the result as list of strings.

Empty strings on the start of the string are preserved, but not those on the end of the string, e.g.:

  ´,´.splitted ",,a,b,c,," == ["", "", "a", "b", "c"]

toString ∷ Regex → String
pure native toString

Returns the string representation of this pattern.

This is the regular expression from which this pattern was Regex.compiled.

unicode_case ∷ Int
pure native java.util.regex.Pattern.UNICODE_CASE

Enables Unicode-aware case folding.

When this flag is specified then case-insensitive matching, when enabled by the Regex.case_insensitive flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.

Unicode-aware case folding can also be enabled via the embedded flag expression (?u).

Specifying this flag may impose a performance penalty.

Because Frege Regex literals will always be parsed with the Regex.unicode_character_class flag, this flag is also implicitly enabled by default.

It can be disabled with the embedded flag expression (?-u).

unicode_character_class ∷ Int
pure native java.util.regex.Pattern.UNICODE_CHARACTER_CLASS

Enables the Unicode version of Predefined character classes and POSIX character classes.

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.

The Regex.unicode_character_class mode can also be enabled via the embedded flag expression (?U).

The flag implies Regex.unicode_case, that is, it enables Unicode-aware case folding.

Specifying this flag may impose a performance penalty.

Frege Regex literals will always be parsed with this flag. It can be disabled with the embedded flag expression (?-U).

unix_lines ∷ Int
pure native java.util.regex.Pattern.UNIX_LINES

Enables UNIX lines mode.

In this mode, only the \n line terminator is recognized in the behavior of ., ^, and $.

Unix lines mode can also be enabled via the embedded flag expression (?d).

data PatternSyntaxException = native java.util.regex.PatternSyntaxException

Will be thrown when one compiles a String to a Regex using Regex.compile or regforce

regcomp ∷ String → (PatternSyntaxException | Regex)
| String → Int → (PatternSyntaxException | Regex)
pure native java.util.regex.Pattern.compile

Alias for Regex.compile

regforce ∷ String → Regex
| String → Int → Regex
pure native java.util.regex.Pattern.compile

compile a String to a Regex

This will throw PatternSyntaxException when the regular expression is illegal.

Use this only if you're sure that pattern compilation will not throw an exception, or if you don't care.

regforce𝖆 ∷ String → Regex
pure native java.util.regex.Pattern.compile overloads regforce

compile a String to a Regex

This will throw PatternSyntaxException when the regular expression is illegal.

Use this only if you're sure that pattern compilation will not throw an exception, or if you don't care.

regforce𝖇 ∷ String → Int → Regex
pure native java.util.regex.Pattern.compile overloads regforce

compile a String to a Regex

This will throw PatternSyntaxException when the regular expression is illegal.

Use this only if you're sure that pattern compilation will not throw an exception, or if you don't care.

class Replace s

this is only used to inject the replace functions into String

Known Instances

StringJ

Member Functions

replaceAll ∷ Replace s ⇒ s → Regex → s → s

Like Replace.replaceFirst, but replaces all substrings that match the pattern.

replaceFirst ∷ Replace s ⇒ s → Regex → s → s

Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string.

This method scans the input sequence from the start looking for a match of the pattern. Characters that are not part of the match are appended directly to the result string; the match is replaced in the result by the replacement string. The replacement string may contain references to captured subsequences.

Note that backslashes (\\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.

For example

 replaceFirst "zzzdogzzzdogzzz" ´dog´ "cat" == "zzzcatzzzdogzzz"

substituteFirst ∷ String → Regex → String → String
pure native frege.runtime.Regex9.replaceFirst

Alias for Replace_String.replaceFirst

substituteAll ∷ String → Regex → String → String
pure native frege.runtime.Regex9.replaceAll

Alias for Replace_String.replaceAll

data MatchResult = pure native java.util.regex.MatchResult

This interface contains query methods used to determine the results of a match against a regular expression.

More: JavaDoc

Member Functions

end ∷ MatchResult → Int → Int
pure native end

Returns the offset after the last character of the subsequence captured by the given group during the previous match operation where group 0 denotes the entire pattern.

If the specified capturing group failed to match, the return value will be -1.

The following property holds:

 (m.group n == Nothing) ==> (m.end n < 0)

group ∷ MatchResult → Int → Maybe String

 group m n

retrieves the input subsequence captured by the given group index during the previous match operation.

Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression (m.group 0) retrieves that portion of the input string that was matched by the pattern.

If the match was successful but the group specified failed to match any part of the input sequence, then Maybe.Nothing is returned.

Note that some groups, for example (a?), match the empty string. This function will return Just "" when such a group successfully matches the empty string in the input.

The result will also be Maybe.Nothing when the specified group does not exist, that is, when the Int argument is outside the range [0..MatchResult.groupCount m]

The following property holds for a MatchResult m with input sequence s and valid group index g:

 isJust (m.group g) ==> (m.group g) == Just (s.substr (m.start g) (m.end g))

groupCount ∷ MatchResult → Int
pure native groupCount

Returns the number of capturing groups in this matcher's pattern.

Group zero denotes the entire pattern by convention. It is not included in this count.

Any non-negative integer smaller than or equal to the value returned by this method is guaranteed to be a valid group index for this matcher.

groupPrim ∷ MatchResult → Int → (IndexOutOfBoundsException | Maybe String)
pure native group

groups ∷ Maybe MatchResult → [Maybe String]

The list of subsequences captured during the previous match operation, or the empty list if there was no match.

To be used like:

 groups $ s =~ regex

Instances

instance Exceptional PatternSyntaxException

Member Functions

javaClass ∷ Class PatternSyntaxException
pure native java.util.regex.PatternSyntaxException.class

instance Replace String

Member Functions

replaceAll ∷ String → Regex → String → String
pure native frege.runtime.Regex9.replaceAll

Like String.replaceFirst, but replaces all substrings that match the pattern.

replaceFirst ∷ String → Regex → String → String
pure native frege.runtime.Regex9.replaceFirst

Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string.

For example

 "zzzdogzzzdogzzz".replaceFirst  ´dog´ "cat" == "zzzcatzzzdogzzz"

Functions and Values by Type

Maybe MatchResult → [Maybe String]: MatchResult.groups
String → Regex → String → String: Replace_String.replaceFirst, Replace_String.replaceAll
String → Regex → Int → Maybe String: ~~~
String → Regex → Int → Maybe MatchResult: findAt
String → Regex → Maybe String: ~~
String → Regex → Maybe MatchResult: =~
String → Regex → [Maybe String]: =~~
String → Regex → [String]: ~~*
String → Regex → Bool: !~, ~
String → Int → (PatternSyntaxException | Regex): Regex.compile𝖇
String → Int → Regex: regforce𝖇
String → (PatternSyntaxException | Regex): Regex.compile𝖆
String → Regex: regforce𝖆
MatchResult → Int → (IndexOutOfBoundsException | Maybe String): MatchResult.groupPrim
MatchResult → Int → Maybe String: MatchResult.group
MatchResult → Int → Int: MatchResult.start, MatchResult.end
MatchResult → String: MatchResult.match
MatchResult → Int: MatchResult.groupCount
Regex → String → JArray String: Regex.split
Regex → String → [String]: Regex.splitted
Regex → String: Regex.toString, Regex.pattern
Regex → Int: Regex.flags
Class PatternSyntaxException: Exceptional_PatternSyntaxException.javaClass
Int: Regex.unix_lines, Regex.unicode_case, Regex.unicode_character_class, Regex.multiline, Regex.dotall, Regex.literal, Regex.comments, Regex.canon_eq, Regex.case_insensitive
Replace s ⇒ s → Regex → s → s: Replace.replaceFirst, Replace.replaceAll
ω: regforce, Regex.compile