Está en la página 1de 52

Introduction to

Regular Expressions
Matt Casto
http://google.com/profiles/mattcas
to

Introduction to
Regular Expressions
Matt Casto
Quick Solutions
http://google.com/profiles/mattcasto

Some people, when


confronted with a problem,
think I know, I'll use
regular expressions.
Now
they have two problems.
- Jamie Zawinski, August 12,
1997

What are Regular


Expressions?

^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3}$
[\w-]+@([\w-]+\.)+[\w-]+
^.+@[^\.].*\.[a-z]{2,}$

^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.
[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z09\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})
(\]?)$

History
Stephen Cole Kleene
American mathematician
credited for inventing
Regular Expressions in the
1950s using a mathematic
notation called regular sets.

History
Ken Thompson
American pioneer of
computer science who,
among many other things,
used Kleenes regular sets
for searching in his QED and
ed text editors.

History
grep
Global Regular Expression
Print

History
Henry Spencer
Wrote the regex library
which is what Perl and Tcl
languages used for regular
expressions.

Why Should You Care?


Example: finding duplicate words in a file.

Requirements:
Output lines that contain duplicate words
Find doubled words that expand lines
Ignore capitalization differences
Ignore HTML tags

Why Should You Care?


Example: finding duplicate words in a file.
Solution:
$/ = .\n;
while (<>) {
next if !s/\b([a-z]+)((?:\s<[^>]+>)+)(\1\b)/\e[7m$1\e[m$2\e[7m$3\e[m/ig;
s/^(?:[^\e]*\n)+//mg;
s/^/$ARGV: /mg;
print;
}

Literal Characters
Any character except a small list of reserved characters.

regex
is
Jack is a boy
match in target string

Literal Characters
Literals will match characters in the middle of words.

regex
a
Jack is a boy
matches in target string

Literal Characters
Literals are case sensitive capitalization matters!

regex
j
Jack is a boy
NOT a match

Special Characters

[ \ ^ $ . | ? * + ( )

Special Characters
You can match special characters by escaping
them with a backslash.

1\+1=2
I wrote 1+1=2 on the chalkboard.

Special Characters
Some characters, such as { and } are only
reserved depending on context.

if (true) \{
else if (true) { beep; }

Non-Printable Characters
Some literal characters can be escaped to
represent non-printable characters.
\t tab
\r carriage return
\n line feed
\a bell
\e escape
\f form feed
\v vertical tab

Period
The period character matches any single
character.

a.boy
Jack is a boy

Character Classes
Used to match only one of the characters
inside square braces.

[Gg]r[ae]y
Grayson drives a grey sedan.

Character Classes
Hyphen is a reserved character inside a
character class and indicates a range.

[0-9a-fA-F]
The HTML code for White is #FFFFFF

Character Classes
Caret inside a character class negates the
match.

q[^u]
Qatar is home to quite a lot of Iraqi
citizens, but is not a city in Iraq

Character Classes
Normal special characters are valid inside of
character classes. Only ] \ ^ and are
reserved.
[+*]
6 * 7 and 18 + 24 both equal 42

Shorthand Character Classes


\d digit or [0-9]
\w word or [A-Za-z0-9_]
\s whitespace or [ \t\r\n] (space, tab, CR,
LF)
[\s\d]
1+2=3

Shorthand Character Classes


\D non-digit or [^\d]
\W non-word or [^\w]
\S non-whitespace or [^\s]
[\D]
1+2=3

Repetition
The asterisk repeats the preceding character class
0 or more times.

<[A-Za-z][A-Za-z0-9]*>
<HTML>Regex is <b>Awesome</b></HTML>

Repetition
The plus repeats the preceding character class 1 or
more times.

<[A-Za-z0-9]+>
Watch out for invalid <HTML> tags like <1>
and <>!

Repetition
The question mark repeats the preceding character
class 0 or 1 times, in effect making it optional.

</?[A-Za-z][A-Za-z0-9]*>
<HTML>Regex is <b>Awesome</b></HTML>

Anchors
The caret anchor matches the position before the
first character in a string.

^vac
vacation evacuation

Anchors
The dollar sign anchor matches the position after
the last character in a string.

tion$
vacation evacuation

Anchors
The caret and dollar sign anchors match the start
and end of the line if the engine has multi-line
turned on.
tion$
vacation evacuation
has ruined my evaluation

Anchors
The \A and \Z shorthand character classes are
like
^ and $ but only match the start and end of the
string.
tion\Z
vacation evacuation
has ruined my evaluation

Word Boundaries
The \b shorthand character class matches
position before the first character in a string
(like ^)
position after the last character in a string (like
$)
between two characters where one is a word
character and the other is not
\b4\b
Weve got 4 orders for 44 lbs of C4

Word Boundaries
The \B shorthand character class is the negated
word boundary any position between to word
characters or two non-word characters.
\Bat\B
vacation evacuation at that
time ate my evaluation

Alternation
The pipe symbol delimits two or more character
classes that can both match.

cat|dog
A cat and dog are expected to follow
the dogma that their presence with one
another leads to catastrophe.

Alternation
Alternations include any character classes.

\bcat|dog\b
A cat and dog are expected to follow
the dogma that their presence with one
another leads to catastrophe.

Alternation
Use parenthesis to group alternating matches
when you want to limit the reach of alternation.

\b(cat|dog)\b
A cat and dog are expected to follow
the dogma that their presence with one
another leads to catastrophe.

Eagerness
Eagerness causes the order of alternations to
matter.

and|android
A robot and an android fight. The ninja wins.

Greediness
Greediness means that the engine will always try
to match as much as possible.

an\S+
A robot and an android fight. The ninja wins.

Laziness
Laziness, or reluctant, modifies a repetition
operator to only match as much as it needs to.

an\S+?
A robot and an android fight. The ninja wins.

Limiting Repetition
You can limit repetition with curly braces.

\d{2,4}
1 11 111 1111 11111

Limiting Repetition
The second number can be omitted to mean
infinite.
Essentially {0,} is the same as * and {1,} same
as +.
\d{2,}
1 11 111 1111 11111

Limiting Repetition
The a single number can be used to match an
exact number of times.

\d{4}
1 11 111 1111 11111

Back References
Parenthesis around a character set groups those
characters and creates a back reference.

([ai]).\1.\1
The magician said abracadabra!

Named Groups
Named groups let you reference matched groups
by their name rather than just index.

(?<vowel>[ai]).\k<vowel>.\1
The magician said abracadabra!

Negative Lookahead
Negative lookaheads match something that is
not there.

q(?!u)
Qatar is home to quite a lot of Iraqi
citizens, but is not a city in Iraq

Positive Lookahead
Positive lookaheads match something that is
there without having that group included in
the match.
q(?=u)
Qatar is home to quite a lot of Iraqi
citizens, but is not a city in Iraq

Positive & Negative


Lookbehind
Lookbehinds are just like lookaheads, but
working backwards.

(?<=a)q
Qatar is home to quite a lot of Iraqi
citizens, but is not a city in Iraq

Resources
Lots of web pages
http://del.icio.us/mattcasto/regex

Mastering Regular Expressions


by Jeffrey Friedl
http://oreilly.com/catalog/9780596528126/

También podría gustarte