Friday, October 8, 2010

Validation Expressions

Metacharacter Match
\ the escape character - used to find an instance of a metacharacter like a period, brackets, etc.
. (period) match any character except newline
x match any instance of x
^x match any character except x
[x] match any instance of x in the bracketed range - [abxyz] will match any instance of a, b, x, y, or z
| (pipe) an OR operator - [x|y] will match an instance of x or y
() used to group sequences of characters or matches
{} used to define numeric quantifiers
{x} match must occur exactly x times
{x,} match must occur at least x times
{x,y} match must occur at least x times, but no more than y times
? preceding match is optional or one only, same as {0,1}
* find 0 or more of preceding match, same as {0,}
+ find 1 or more of preceding match, same as {1,}
^ match the beginning of the line
$ match the end of a line
POSIX Class Match
[:alnum:] alphabetic and numeric characters
[:alpha:] alphabetic characters
[:blank:] space and tab
[:cntrl:] control characters
[:digit:] digits
[:graph:] non-blank (not spaces and control characters)
[:lower:] lowercase alphabetic characters
[:print:] any printable characters
[:punct:] punctuation characters
[:space:] all whitespace characters (includes [:blank:], newline, carriage return)
[:upper:] uppercase alphabetic characters
[:xdigit:] digits allowed in a hexadecimal number (i.e. 0-9, a-f, A-F)

Character class Match
\d matches a digit, same as [0-9]
\D matches a non-digit, same as [^0-9]
\s matches a whitespace character (space, tab, newline, etc.)
\S matches a non-whitespace character
\w matches a word character
\W matches a non-word character
\b matches a word-boundary (NOTE: within a class, matches a backspace)
\B matches a non-wordboundary

  • \
    The backslash escapes any character and can therefore be used to force characters to be matched as literals instead of being treated as characters with special meaning. For example, '\[' matches '[' and '\\' matches '\'.
  • .
    A dot matches any character. For example, 'go.d' matches 'gold' and 'good'.
  • { }
    {n} ... Match exactly n times
    {n,} ... Match at least n times
    {n,m} ... Match at least n but not more than m times
  • [ ]
    A string enclosed in square brackets matches any character in that string, but no others. For example, '[xyz]' matches only 'x', 'y', or 'z', a range of characters may be specified by two characters separated by '-'. Note that '[a-z]' matches alphabetic characters, while '[z-a]' never matches.
  • [-]
    A hyphen within the brackets signifies a range of characters. For example, [b-o] matches any character from b through o.
  • |
    A vertical bar matches either expression on either side of the vertical bar. For example, bar|car will match either bar or car.
  • *
    An asterisk after a string matches any number of occurrences of that string, including zero characters. For example, bo* matches: bo, boo and booo but not b.
  • +
    A plus sign after a string matches any number of occurrences of that string, except zero characters. For example, bo+ matches: boo, and booo, but not bo or be.
  • \d+
    matches all numbers with one or more digits
  • \d*
    matches all numbers with zero or more digits
  • \w+
    matches all words with one or more characters containing a-z, A-Z and 0-9. \w+ will find title, border, width etc. Please note that \w matches only numbers and characters (a-z, A-Z, 0-9) lower than ordinal value 128.
  • [a-zA-Z\xA1-\xFF]+
    matches all words with one or more characters containing a-z, A-Z and characters larger than ordinal value 161 (eg. ä or Ü). If you want to find words with numbers, then add 0-9 to the expression: [0-9a-zA-Z\xA1-\xFF]+



Typical examples

  • (bo*)
    will find "bo", "boo", "bot", but not "b"
  • (bx+)
    will find "bxxxxxxxx", "bxx", but not "bx" or "be"
  • (\d+)
    will find all numbers
  • (\d+ visitors)
    will find "3 visitors" or "243234 visitors" or "2763816 visitors"
  • (\d+ of \d+ messages)
    will find "2 of 1200 messages" or "1 of 10 messages"
  • (\d+ of \d+ messages)
    will filter everything from the last occurrence of "2 of 1200 messages" or "1 of 10 messages" to the end of the page
  • (MyText.{0,20})
    will find "MyText" and the next 20 characters after "MyText"
  • (\d\d.\d\d.\d\d\d\d)
    will find date-strings with format 99.99.9999 or 99-99-9999 (the dot in the regex matches any character)
  • (\d\d\.\d\d\.\d\d\d\d)
    will find date-strings with format 99.99.9999
  • (([_a-zA-Z\d\-\.]+@[_a-zA-Z\d\-]+(\.[_a-zA-Z\d\-]+)+))
    will find all e-mail addresses

No comments:

Post a Comment