This is a compilation of three articles from "Today on The World" about the subject of regular expressions. The "official" documentation for the core set of of regular expression features is in: world% man ed The man pages for sed, grep, and egrep describe additional features those programs support (and features from the "core set" that aren't supported. TIP: Regular expressions (part 1) We've alluded many times to "regular expressions" in "Today on The World", but have never attempted to systematically explain them. We'll be running a number of articles about regular expressions with examples of how they are used. * What are regular expressions, and why should I care about them? Many programs that were developed under UNIX use regular expressions. A "regular expression" is a pattern which will match certain text. Programs that use regular expressions search through files, or databases, or editor buffers, or standard input, or just about anything else (constellations, jewelry cabinets), and do something with the text that matches the regular expression -- print it out, alter it in some way and then print it out, mail it to someone. It all depends on the program. This general description will be fleshed out with examples in the articles to come. The "classic" UNIX programs that use regular expressions, "grep" and "sed," have been ported to PCs, Amigas, and, I assume, Macs; as have programming languages that make heavy use of regular expressions, such as "awk" and "perl". Yesterday's article about searching /usr/dict/words showed the best known of the programs that use regular expressions, "grep". "grep" stands for "Global Regular Expression Print". As mentioned yesterday, the file "words" contains a list of words, one per line. We showed a trivial example: world% grep portion /usr/dict/words This looks for the pattern "portion" in the "words" file, and prints each line where where that pattern is found. Regular expressions can find patterns that are much more complicated. For example, if you want to find any line which starts with a "D", has a hyphen in the 50th column and is followed by a single space and then a digit, "grep" can do that. What I had in mind when I made up that example was an ASCII dump of a billing database, where every line that begins with "D" represents a disbursement, and the dollar amount of each transaction starts in the 50th column, and you are looking for large negative numbers. My point is that anyone who works with computers might find regular expressions useful; even if you don't see immediate applications of regular expressions in your internet work (or play), you should know they exist: there is a way to find the information you want in any textual data. Keep that in mind when wading through the morass that is the internet; knowing a little (or a lot) about regular expressions, you can find more of what you're looking for. The rules for regular expressions are not terribly complicated. But they tend to use a lot of characters that are special to the UNIX shell. "Escaping" or "quoting" those special characters can get complex; you have to escape the characters to prevent the UNIX shell from altering them before passing starting the program you are running. (See "Today on The World" 1.076 and 1.175.) * Example 1: finding text with 2 or more substrings For our first example of using regular expressions, let's consider an email archie search. Using email avoids the complications that accompany running the regular expression through the UNIX shell. (Archie is a program that allows you to search for files stored on ftp servers around the Internet. For general information about searching archie through email, see "Today on The World" 1.091.) We'll be looking at a common class of regular expressions, the case where you're looking for text that contains two known substrings in known order, but with unknown intervening text. Thinking back to yesterday's example, "how do you spell 'bourgeois'", if all you knew was that it began with "b" and had a "urg" in it somewhere, that's similar to the type of thing we'll be looking at today. But we'll stick to archie searches for now. Let's say you have a program called "GS261EXE.ZIP" and you wonder if there is a newer version. An archie server can check the ftp servers on the net for you and let you know. The problem is that you don't know the name of the file you're searching for. The file you want most likely begins with GS, followed by some text, you don't know how much or what it is, followed by EXE. (The unknown text is probably all numbers, but you never know, there might be "GS401AEXE.ZIP".) So, you could send a search with: set search regex find GS.*EXE end The .* (usually pronounced "dot star") notation is one of the most common patterns in regular expressions. The dot is a regular expression that matches any character, and the star, following a one character regular expression like ".", means "any number of that regular expression." So, the string above would find files named "GS345EXE". It would also find, if they existed, a lot of files with names unlike what you had in mind; when I ran the search, it actually found, among other things DLGS.EXE (.* matched the period) JIGSAW.EXE (.* matched "AW.") Although it didn't come up in this case, if there's a file out there called "GSEXECUTE.ARJ", this pattern will find it. The star means "any number of the previous character, including zero". There are a number of ways the search can be refined, we could have it look for files that begin with 'GS' (and that would probably actually solve the problem adequately in real life.) We could have it look for GS followed by any number of characters other than "." followed by EXE. The example I'm going to use, in fact, doesn't work as well as either of those methods (or some others) would, but I'm choosing it to illustrate the use of "." and "*" further. I searched for: GS.*EXE..* What is this saying? In "..*" the first dot matches any single character. Since it's not followed by a "*", it matches exactly one character; for a file's name to match the pattern, there has to be a character following "EXE". Then we have ".*" again, which matches any string of characters, including the empty string (zero characters); we'll see anything where there's at least on character, but maybe no more, after EXE. The point is that we are weeding out all those file names that end with ".EXE": since they don't have any characters after "EXE", they don't match the pattern. (Readers already familiar with archies searches might note that the second '.*' is superfluous; that's true, dot-star is implicit at the beginning and end of archie searches.) That's enough for today's example. The ".*" pattern is seen very often in regular expressions. "..*" is actually not very common, in my experience. But some closely related patterns do come up; "..*" means "at least one character", often you do want something like "at least one digit", and you use a similar syntax to express that. If you try this search, it actually won't find anything; it turns out the files I was thinking of are in lower case, but that's also something we'll deal with in future issues. ____________________________________________________________ TIP: Regular expressions (part 2) (Part 1 of this article appeared "Today on The World" 1.205.) The first article in this series explained the dot ("."), which is a regular expression standing for any single character, and the star ("*"), which signifies zero or more occurences of the previous single character regular expression. This article will look at sets of characters (and and inversions of those sets.) If you are looking for a pattern that might contain one of a number of characters, you can put those characters in square brackets. For example, say you were looking at a mail folder with "more", and thought you remembered getting mail from someone named "bill" or "william". (Okay, you might not usually use "more" for this, but it's for the sake of example.) More lets you search for regular expressions with "/". You want to find something that looks like one of: From: bill.... From: Bill.... From: will.... From: Will.... You can search for that with the regular expression: From: [bwWB]ill That will match any line that contains any of the four patterns above. Since email addresses typically don't really look like that, instead something like: From: wwr@sun4.state.edu (William Sample) you would probably really want to use the dot star notation to say "I don't know what comes immediately after 'From: ', but find anything with 'From: ' followed anywhere on the line by Bill or Will or . . . The regular expression for that: From: .*[bwWB]ill Actually, in my experience, using lists of characters like this doesn't come up often, but you do frequently want to search somewhere where you know that a character will be in a certain range; typically, you want to specify "any letter", or "any digit". Inside square brackets, you can denote ranges with a hyphen. For example, [0-9] matches any singe digit. As an example, in the email archie search we gave last time, the final example of a pattern to find files with names like "GS261EXE.ZIP" was: GS.*EXE. Meaning, any file with "GS" followed by anything, followed by EXE, followed by at least one other character. In practice, this would work okay, but in theory, it would find files with names like "EGGS-AND-EXECUTIVES". If were to look for: GS[0-9].*EXE. That pattern would tell archie to find files with GS, followed by a digit, followed by any text, followed by EXE, followed by at least one character. Or, say you are looking for any file editted on the 12th, 13th, or 14th of July, you can run: world% ls -l | more And, in more, type /Jul 1[2-4] (usually, you'd probably just pipe ls into "grep" in real life, but for now we want to separate regular expressions from the issues of quoting introduced by the shell.) Patterns in brackets can also be "inverted". If the first character inside square brackets is a caret ("^"), the bracketted material is a regular expression matching any character except those in the brackets. For example, in the example above, if we wanted to find any file that was editted ont the 10th, 11th, 15th, 16th, 17th, 18th, or 19th of July, this would probably do it: /Jul 1[^2-4] A more realistic example is another archie search. A customer asked recently if there was a vi for dos; I had archie search for: vi[^a-z].*exe and vi[^a-z].*zip Meaning, search for any file with "vi" in the name folowed by a non-letter character, followed by anyting, follwed by "exe" (or "zip"). Ranges, inverted ranges, "dot" and "star" are enough to write most regular expressions. There are a few more important metacharacters which most programs that work with regular expressions use. We'll cover those in the next article about regular expressions, and start looking at using regular expressions at the shell with the most commonly used of those programs, grep. ____________________________________________________________ TIP: Regular Expressions, part 3 The first and second articles about using regular expressions appeared in "Today on The World," 1.205 and 1.213, respectively. We've covered most of the core of regular expressions; there are just a few special symbols to look at now. ^ represents the beginning of a line, $ the end of a line, \ "escapes" special symbols, making them represent themselves. Actually that wording about beginning and ending is inaccurate, though it ends up being intuitively clear: ^ and $ have the significance mentioned at the beginning and end of a regular expression, respectively. That is to say: .*^ just matches any line with a caret in it; while: $.* would match any line with a dollar sign. In the first of this series of articles we mentioned that you could look for complex with the example: "any line which starts with a "D", has a hyphen in the 50th column and is followed by a single space and then a digit." Now that we have the caret, we can show a regular expression that finds that (let's change that 50th column to the 10th, though, so careful proofreaders won't hurt their eyes): ^D........- [0-9] We also said that grep would be able to look through a given file to find the pattern, but we didn't show grep because we wanted to avoid the complications of typing regular expressions to the shell. The example above would present two problems: grep would think the space was the end of the regular expression; and that [0-9] was supposed to be a filename. The remedy to this, in the c shell, is to quote the regular expression in single quotes. For example: world% grep '^D........- [0-9]' filename would do the trick. Now, let's take some other examples. If you were interested in finding the earlier issues of "Today on The World" about regular expressions: world% grep '^TIP.*[Rr]eg.*art [12])$' /help/yesterday/* Would find any line in any of the files in help yesterday that: ^TIP begins with TIP .* followed by anything (or nothing) [Rr]eg followed by Reg or reg .* followed by anything art followed by "art" (as in Part or part) [12] followed by a one or two )$ followed by a ) with the paren being the last character in the line. Say you had a file with sections marked like: 1. Overview 1.1 Before the Diamond .... 10. The reign of the Red Sox 10.1 Falling Rocket, rising stars That is, sections begin with sometimes with some spaces, sometimes without them, followed by one or more digits, followed by a period, and possibly a number, then some spaces, then a capital letter. The following command could find all the lines that were section headings: world% grep '^ *[0-9][0-9]*\.[0-9]* *[A-Z]' baseball The '[0-9][0-9]*' part means "one or more digits". The \. tells grep that we really mean a "dot" has to come after those digits, not just any character (what "." usually means). (It's not guaranteed not to find non-header lines, but it's likely not to.) Those examples are a bit complex (but in real life you sometimes need much more complex ones); something a bit similar to the one above came up for me in trying to sort through some mail. In the mail program I use, you can hit "|" to pipe the current message to a shell command. I was reading lists of books for sale, too long to read through, but each at least had the author's name in the first line of the description, and began with a number. So, I could do: |grep '^[0-9][0-9]*' Meaning "print any line that starts with one or more digits". It actually produced a bit more trash than I wanted, but each number was followed by a period, so: |grep '^[0-9][0-9]*\.' did the trick. The special character "?" is like "*", except that it means "0 or 1" of the preceding characters. For example, to fine any line in a mail spool that begins with either "From " or "From: ", the regular expression would be: "^From:? " (quotes are only to show the space after the question mark; they aren't part of the regular expression per se