Repost note: original link unknown
GAWK Manual
Author:
Wilbur Lang
Chapter 1 Preface
Chapter 2 Introduction
Chapter 3 Reading input files
Chapter 4 Printing
Chapter 5 Patterns
Chapter 6 Expressions as descriptions of Actions
Chapter 7 Control statements inside Actions
Chapter 8 Built-in Functions
Chapter 9 User-defined functions
Chapter 10 Examples
Chapter 11 Conclusion
Chapter 1 Preface
awk is a programming language with very powerful capabilities for data processing. For processing such as modifying, comparing, and extracting data in text files, awk can easily accomplish it with very short programs. If you use languages such as C or Pascal to write programs to do the above, it would be inconvenient and very time-consuming, and the programs written would also be quite large.
awk can break down input data according to user-defined formats, and can also print data according to user-defined formats.
The name awk comes from the first letters of the surnames of its original designers: Alfred V. Aho, Peter J. Weinberger, Brian W. Kernighan.
awk was first completed in 1977. A new version of awk was released in 1985, and its functions were much enhanced compared with the old version.
gawk is GNU's awk. gawk was first completed in 1986, and has since been continuously improved and updated. gawk includes all the functions of awk.
The following gawk examples will use the 2 input files below for illustration.
File 'BBS-list':
aardvark 555-5553 1200/300 B
alpo-net 555-3412 2400/1200/300 A
barfly 555-7685 1200/300 A
bites 555-1675 2400/1200/300 A
camelot 555-0542 300 C
core 555-2912 1200/300 C
fooey 555-1234 2400/1200/300 B
foot 555-6699 1200/300 B
macfoo 555-6480 1200/300 A
sdace 555-3430 2400/1200/300 A
sabafoo 555-2127 1200/300 C
File 'shipped':
Jan 13 25 15 115
Feb 15 32 24 226
Mar 15 24 34 228
Apr 31 52 63 420
May 16 34 29 208
Jun 31 42 75 492
Jul 24 34 67 436
Aug 15 34 47 316
Sep 13 55 37 277
Oct 29 54 68 525
Nov 20 87 82 577
Dec 17 35 61 401
Jan 21 36 64 620
Feb 26 58 80 652
Mar 24 75 70 495
Apr 21 70 74 514
Chapter 2 Introduction
gawk's main function is to search each line of a file for specified patterns.
When a line matches the specified patterns, gawk executes the specified actions on that line. gawk processes every line of the input file this way until the end of the input file.
A gawk program is made up of many patterns and actions. The action is written inside braces { }. A pattern is followed by an action. The whole gawk program looks like this:
pattern {action}
pattern {action}
In the rules inside a gawk program, the pattern or action can be omitted,
but the two cannot both be omitted at the same time. If the pattern is omitted,
the action will be executed for every line in the input file. If the action is omitted, the default action prints all input lines that match the pattern.
2.1 How to execute a gawk program
Basically, there are 2 ways to execute a gawk program.
□If the gawk program is short, gawk can be written directly on the command line, as shown below:
gawk 'program' input-file1 input-file2 ...
Here program includes some patterns and actions.
□If the gawk program is longer, a more convenient method is to store the gawk program in a file,
that is, write the patterns and actions in a file named program-file. The format for executing
gawk is shown below:
gawk -f program-file input-file1 input-file2 ...
If there is more than one gawk program file, the format for executing gawk is shown below:
gawk -f program-file1 -f program-file2 ... input-file1
input-file2 ...
2.2 A simple example
Now let us look at a simple example. Because the gawk program is short, the gawk program is written directly on the command line.
gawk '/foo/ {print $0}' BBS-list
The actual gawk program is /foo/ {print $0}. /foo/ is the pattern, meaning it searches
every line in the input file to see whether it contains the substring 'foo'. If it contains 'foo', then the action is executed.
The action is print $0, which prints the contents of the current line. BBS-list is the input file.
After executing the above command, the following result will be printed:
fooey 555-1234 2400/1200/300 B
foot 555-6699 1200/300 B
macfoo 555-6480 1200/300 A
sabafoo 555-2127 1200/300 C
2.3 A more complex example
gawk '$1 == "Feb" {sum=$2+$3} END {print sum}' shipped
This example compares the first field of the input file 'shipped' with "Feb".
If they are equal, then the values of the 2nd and 3rd fields will be added to the variable sum.
This action is repeated for every line in the input file until every line has been processed.
Finally, the value of sum is printed. END {print sum} means that after all input has been
read, the action print sum is executed once, that is, the value of sum is printed.
The result is:
84
Chapter 3 Reading input files
gawk input can be read from standard input or from specified files. The unit
of input is called a "record" (records). When processing, gawk handles one record at a time. (p9 of 46)
The default value of each record is one line, and a record is divided into multiple
fields.
3.1 How input is divided into records
The gawk language divides input into records. Records are separated from each other by the
record separator. The default value of the record separator is the newline
character, so the default record separator makes each line of text one record.
The record separator changes when the built-in variable RS changes. RS is a string,
and its default value is "\n". Only the first character of RS is effective; it is treated as the record
separator, while the other characters in RS are ignored.
The built-in variable FNR stores the current input file 已颈欢寥〉募锹贾鍪N
built-in variable NR stores the total of all input files so far 已颈欢寥〉募锹贾鍪
3.2 Fields
gawk automatically divides each record into multiple fields. Similar to words in a
line, gawk's default behavior treats fields as being separated by whitespace. In
gawk, whitespace means one or more spaces or tabs.
In a gawk program, '$1' represents the first field, '$2' the second field,
and so on. For example, suppose one input line is as follows:
This seems like a pretty nice example.
The first field or $1 is 'This', the second field or $2 is 'seems', and so on.
One point deserves special attention: the seventh field or $7 is 'example.' rather than 'example'.
No matter how many fields there are, $NF can be used to represent the last field of a record. Using
the example above, $NF is the same as $7, namely 'example.'.
NF is a built-in variable whose value represents the number of fields in the current record. $0 may look like the zeroth field, but it is a special case; it represents the entire record.
Here is a somewhat more complex example:
gawk '$1~/foo/ {print $0}' BBS-list
The result is as follows:
fooey 555-1234 2400/1200/300 B
foot 555-6699 1200/300 B
macfoo 555-6480 1200/300 A
sabafoo 555-2127 1200/300 C
This example checks the first field of each record in the input file 'BBS-list'. If
it contains the substring 'foo', then that record is printed.
3.3 How records are divided into fields
gawk divides a record into fields according to the field separator. The field sepa- rator is represented by the built-in variable FS.
For example, if the field separator is 'oo', then the following line:
moo goo gai pan
will be divided into three fields: 'm', ' g', ' gai pan'.
In a gawk program, '=' can be used to change the value of FS. For example:
gawk 'BEGIN {FS=","}; {print $2}'
The input line is as follows:
John Q. Smith, 29 Oak St., Walamazoo, MI 42139
Executing gawk will print the string ' 29 Oak St.'. The action after BEGIN is executed once
before the first record is read.
Chapter 4 Printing
In gawk programs, the thing actions do most often is printing. For simple
printing, use the print statement. For complex formatted printing, use the printf statement.
4.1 The print statement
The print statement is used for simple, standard output format. The statement format is as follows:
print item1, item2, ...
When outputting, each item is separated by a space, and a newline is added at the end.
If nothing follows the 'print' statement, it has the same ef-
fect as 'print $0'; it prints the current record. To print a blank line, you can use 'print
""'. To print a fixed piece of text, you can enclose the text in double quotes, for example
'print "Hello there"'.
Here is an example that prints the first two fields of each input record:
gawk '{print $1,$2}' shipped
The result is as follows:
Jan 13
Feb 15
Mar 15
Apr 31
May 16
Jun 31
Jul 24
Aug 15
Sep 13
Oct 29
Nov 20
Dec 17
Feb 26
Mar 24
Apr 21
4.2 Output Separators
Earlier we already mentioned that if a print statement contains multiple items, and the items
are separated by commas, then when printed each item will be separated by a space. You can use any
string as the output field separator; you can change the output field separator through the set-
ting of the built-in variable OFS. The initial value of OFS is " ", that is, one
space.
The output of the entire print statement is called the output record. After the print statement
outputs the output record, it then outputs a string called the output
record separator. The built-in variable ORS is used to indicate this string. The initial value
of ORS is "\n", that is, a newline.
The following example prints the first and second fields of each record. These two
(p16 of 46)
fields are separated by a semicolon ';', and a blank line is added after each output line.
gawk 'BEGIN {OFS=";"; ORS="\n\n"} {print $1, $2}' BBS-list
The result is as follows:
aardvark;555-5553
alpo-net;555-3412
barfly;555-7685
bites;555-1675
camelot;555-0542
core;555-2912
fooey;555-1234
foot;555-6699
macfoo;555-6480
sdace;555-3430
sabafoo;555-2127
4.3 The printf statement
The printf statement makes it easier to control the output format precisely. The printf statement can
specify the width of each printed item, and can also specify various numeric formats.
The format of the printf statement is:
printf format, item1, item2, ...
The difference between print and printf lies in format; the arguments of printf have one more
string format than print. The form of format is the same as the format of ANSI C's printf.
printf does not automatically output a newline. The built-in variables OFS and ORS have no effect on printf state-
ments.
A format specification begins with the character '%', followed by a format control letter.
The format control letters are as follows:
'c' Print a number as an ASCII character.
For example, 'printf "%C",65' prints the character 'A'.
'd' Print a decimal integer.
'i' Print a decimal integer.
'e' Print a number in scientific notation.
For example
print "$4.3e",1950
(p19 of
The result will print '1.950e+03'.
'f' Print a number in floating-point form.
'g' Print a number either in scientific notation or in floating-point form. If the absolute value of the number
is greater than or equal to 0.0001, then it is printed in floating-point form; otherwise it is printed
in scientific notation.
'o' Print an unsigned octal integer.
's' Print a string.
'x' Print an unsigned hexadecimal integer. 10 through 15 are represented by 'a' through 'f'.
'X' Print an unsigned hexadecimal integer. 10 through 15 are represented by 'A' through 'F".
'%' It is not really a format control letter; '%%" prints "%".
A modifier can be added between % and the format control letter. A modifier is used to fur-
ther control the output format. Possible modifiers are as follows:
'-' Used before width, indicating left alignment. If '-' does not appear, then it will be
right-aligned within the specified width. For example:
printf "%-4S", "foo"
will print 'foo '.
'width' This number indicates the width to be used when printing the corresponding field. For example:
printf "%4s","foo"
will print ' foo'.
The value of width is a minimum width, not a maximum width. If an item
requires more width than width, then it is not affected by width. For example
printf "%4s","foobar"
will print 'foobar'.
'.prec' This number specifies the precision when printing. It specifies the number of digits to the right of the decimal point. If
a string is to be printed, it specifies how many charac-
ters of this string will be printed at most.
Chapter 5 patterns
In a gawk program, only when a pattern matches the current input record does its
corresponding action get executed.
5.1 Types of patterns
Here is a summary of the various forms of patterns in gawk:
/regular expression/
(p22 of
A regular expression used as a pattern. Whenever an input record (
record) contains the regular expression, it is considered a match.
expression
A single expression. When a value is not 0, or a string is not empty,
it can be considered a match.
pat1,pat2
A pair of patterns separated by a comma, specifying a range of records.
BEGIN
END
These are special patterns; gawk will execute the actions corresponding
to BEGIN or END when starting execution or when finishing.
null
This is an empty pattern. It is considered to match every input record.
(p23 of
5.2 Regular Expressions as Patterns
A regular expression, abbreviated regexp, is a way of describing a string. A regular expression
enclosed in slashes ('/') serves as a gawk pattern.
If an input record contains the regexp, it is considered a match. For example, if the pattern is /foo/,
then any input record containing 'foo' is considered a match.
The following example prints the 2nd field of input records containing 'foo'.
gawk '/foo/ {print $2}' BBS-list
The result is as follows:
555-1234
555-6699
555-6480
555-2127
regexp can also be used in comparison expressions.
(p24 of
exp ~ /regexp/
If exp matches regexp, the result is true.
exp !~ /regexp/
If exp does not match regexp, the result is true.
5.3 Comparison Expressions as Patterns
Comparison patterns are used to test relationships between two numbers or strings such as greater than, equal to,
or less than. Some comparison patterns are listed below:
x<y If x is less than y, the result is true.
x<=y If x is less than or equal to y, the result is true.
x>y If x is greater than y, the result is true.
x>=y If x is greater than or equal to y, the result is true.
x==y If x is equal to y, the result is true.
x!=y If x is not equal to y, the result is true.
x~y If x matches regular expression y, the result is true.
(p25 of
x!~y If x does not match regular expression y, the result is true.
For x and y mentioned above, if both are numbers then it is treated as a numeric comparison;
otherwise they are converted to strings and compared as strings. Two strings are compared by
first comparing the first character, then the second character, and so on, until a difference
appears. If two strings are equal up to the end of the shorter one, then the longer
string is considered greater than the shorter one. For example, "10" is less than "9", and "abc" is less than "abcd".
5.4 Patterns Using Boolean Operators
A boolean pattern combines other patterns using the boolean operators "or" ('||'), "and"
('&&'), and "not" ('!').
For example:
gawk '/2400/ && /foo/' BBS-list
gawk '/2400/ || /foo/' BBS-list
gawk '! /foo/' BBS-list
Chapter 6 Expressions as Actions
Expressions are the basic building blocks of actions in gawk programs.
6.1 Arithmetic operations
The arithmetic operations in gawk are as follows:
x+y addition
x-y subtraction
-x negative
+x positive. Actually it has no effect.
x*y multiplication
x/y division
x%y remainder. For example 5%3=2.
x^y
x**y x to the power y. For example 2^3=8.
6.2 Comparison Expressions and Boolean Expressions
A comparison expression is used to compare relationships
between strings or numbers; the operator symbols are the same as in the C language. They are listed below:
x<y
x<=y
x>y
x>=y
x==y
x!=y
x~y
x!~y
If the comparison result is true, its value is 1.
Otherwise its value is 0.
There are three kinds of boolean expressions:
boolean1 && boolean2
boolean1 || boolean2
! boolean
6.3 Conditional Expressions
A conditional expression is a special kind of expression that contains 3 operands.
Conditional expressions are the same as in the C language:
selector ? if-true-exp : if-false-exp
It has 3 subexpressions. The first subexpression selector is evaluated first. If it is true,
then if-true-exp is evaluated and its value becomes the value of the whole expression. Otherwise if-false-
exp is evaluated and its value becomes the value of the whole expression.
For example, the following expression produces the absolute value of x:
x>0 ? x : -x
Chapter 7 Control statements inside Actions
In gawk programs, control statements such as if and while control the flow
of program execution. The control statements in gawk are similar to those in C.
Many control statements include other statements; the included statements are called the body. If
the body includes more than one statement, these statements must be enclosed in braces { },
and the statements must be separated by newlines or semicolons.
7.1 The if statement
if (condition) then-body
(p30 of
If condition is true, then then-body is executed; otherwise else-body is executed.
An example is as follows:
if (x % 2 == 0)
print "x is even"
else
print "x is odd"
7.2 The while statement
while (condition)
body
The first thing a while statement does is test condition. If condition is true, then
the body statement is executed. After the body statement has finished executing, condition is tested again. If
condition is true, then the body is executed again. This process is repeated until
condition is no longer true. If condition is false on the first test, then
the body is never executed.
The following example prints the first three fields of each input record.
gawk '{ i=1
while (i <= 3) {
print $i
i++
}
}'
7.3 The do-while statement
do
body
while (condition)
This do loop executes body once, and then repeats body as long as condition is true.
(p32 of
Even if condition is false at the start, body is still executed once.
The following example prints each input record ten times.
gawk '{ i= 1
do {
print $0
i++
} while (i <= 10)
}'
7.4 The for statement
for (initialization; condition; increment)
body
This statement executes initialization at the start, and then as long as condition is true, it
repeatedly executes body and performs increment.
The following example prints the first three fields of each input record.
gawk '{ for (i=1; i<=3; i++)
print $i
}'
7.5 The break statement
A break statement jumps out of the innermost enclosing for, while, or do-while loop.
The following example finds the smallest divisor of any integer, and also determines whether it is prime.
gawk '# find smallest divisor of num
{ num=$1
for (div=2; div*div <=num; div++)
if (num % div == 0)
break
if (num % div == 0)
printf "Smallest divisor of %d is %d\n", num, div
else
printf "%d is prime\n", num }'
7.6 The continue statement
(p34 of 46)
The continue statement is used inside for, while, and do-while loops. It skips
the rest of the loop body, causing the next loop iteration to begin immediately.
The following example prints all the numbers from 0 to 20, but 5 will not be printed.
gawk 'BEGIN {
for (x=0; x<=20; x++) {
if (x==5)
continue
printf ("%d",x)
}
print ""
}'
7.7 The next statement, next file statement, and exit statement
The next statement forces gawk to immediately stop processing the current record and continue with the next
record.
The next file statement is similar to next. However, it forces gawk to immediately stop processing the current
data file.
The exit statement causes the gawk program to stop executing and exit. However, if END appears,
it will execute the END actions.
Chapter 8 Built-in Functions
Built-in functions are functions built into gawk, and built-in
functions can be called anywhere in a gawk program.
8.1 Numeric built-in functions
int(x) gets the integer part of x, truncating toward 0. For example: int(3.9)
is 3, and int(-3.9) is -3.
(p36 of 46)
sqrt(x) gets the positive square root of x. Example: sqrt(4)=2
exp(x) gets x's power. Example: exp(2) means e*e .
log(x) gets the natural logarithm of x.
sin(x) gets the sine value of x, where x is in radians.
cos(x) gets the cosine value of x, where x is in radians.
atan2(y,x) gets the arctangent value of y/x, and the resulting value is in radians.
rand() produces a random number value. This random number is uniformly distributed between 0 and 1. This
value will not be 0, nor will it be 1.
Each time gawk runs, rand starts producing numbers from the same point, or seed.
srand(x) sets the starting point, or seed, for generating random numbers to x. If the second time you set
the same seed value, you will get the same sequence of random numbers again.
If the argument x is omitted, for example srand(), then the current date and time will
be used as the seed. This method makes the random numbers truly unpredictable.
The return value of srand is the previously set seed value.
8.2 String built-in functions
index(in, find)
(p37 of 46)
It looks in the string in for the first occurrence of the string find, and the return value is
the position where string find appears in string in. If string find cannot be found in string in,
then the return value is 0.
For example:
print index("peanut","an")
will print 3.
length(string)
Gets how many characters string has.
For example:
length("abcde")
is 5.
match(string,regexp)
The match function looks in the string string for the longest, leftmost
substring that matches regexp. The return value is
the starting position of regexp in string, that is, the index
value.
The match function sets the built-in variable RSTART equal to index, and also sets the built-in vari-
able RLENGTH equal to the number of matched characters. If there is no match, then RSTART is set to
0 and RLENGTH to -1.
(p38 of 46)
sprintf(format,expression1,...)
Similar to printf, but sprintf does not print; instead it returns a string.
For example:
sprintf("pi = %.2f (approx.)',22/7)
the returned string is "pi = 3.14 (approx.)"
sub(regexp, replacement,target)
In the string target, find the longest, leftmost place that matches regexp, and
replace the leftmost regexp with the string replacement.
For example:
str = "water, water, everywhere"
sub(/at/, "ith",str)
The resulting string str becomes
"wither, water, everywhere"
gsub(regexp, replacement, target)
gsub is similar to the previous sub. In the string target, find all places that match regexp,
and replace all regexp occurrences with the string replacement.
For example:
(p39 of 46)
str="water, water, everywhere"
gsub(/at/, "ith",str)
The resulting string str becomes
'wither, wither, everywhere"
substr(string, start, length)
Returns a substring of string string. This substring has a length of length characters,
starting from position start.
For example:
substr("washington",5,3)
the return value is "ing"
If length does not appear, then the returned substring starts from position start
and continues to the end.
For example:
substr("washington",5)
the return value is "ington"
tolower(string)
Changes uppercase letters in string string to lowercase letters.
For example:
tolower("MiXeD cAsE 123")
the return value is "mixed case 123"
toupper(string)
Changes lowercase letters in string string to uppercase letters.
For example:
toupper("MiXeD cAsE 123")
the return value is "MIXED CASE 123"
8.3 Input/output built-in functions
close(filename)
Closes the input or output file filename.
system(command)
This function allows the user to execute operating system commands; after execution, it returns to the gawk
program.
For example:
BEGIN {system("ls")}
Chapter 9 User-defined Functions
Complex gawk programs can often be simplified by using user-defined
functions. Calling a user-defined function is the same as calling a built-in function.
9.1 Function definition format
A function definition can be placed anywhere in a gawk program.
The format of a user-defined function is as follows:
function name (parameter-list) {
body-of-function
}
name is the name of the defined function. A valid function name can include a sequence of let-
ters, digits, and underscores, but it cannot begin with a digit.
parameter-list lists all the function's arguments, separated
from each other by commas.
body-of-function contains gawk statements. It is the most important part
of the function definition, and it determines what the function actually does.
9.2 An example of a function definition
The following example adds together the square of the value of the first field of each record and the square of the value of the second
field.
{print "sum =",SquareSum($1,$2)}
function SquareSum(x,y) {
sum=x*x+y*y
return sum
}
Chapter 10 Examples
Some examples of gawk programs will be listed here.
gawk '{if (NF > max) max = NF}
END {print max}'
This program prints the maximum number of fields among all input lines.
gawk 'length($0) > 80'
This program prints every line that exceeds 80 characters. Here only the pattern is
listed; the action uses the default print.
gawk 'NF > 0'
This program prints every line that has at least one field. This is a sim-
ple way to delete all blank lines in a file.
gawk '{if (NF > 0) print}'
This program prints every line that has at least one field. This is a sim-
ple way to delete all blank lines in a file.
gawk 'BEGIN {for (i = 1; i <= 7; i++)
print int(101 * rand())}'
This program prints 7 random numbers in the range from 0 to 100.
ls -l files | gawk '{x += $4}; END {print "total bytes: " x}'
This program prints the total number of bytes of all specified files.
expand file | gawk '{if (x < length()) x = length()}
END {print "maximum line length is " x}'
This program prints the length of the longest line in the specified file. expand changes tabs
into spaces, so comparison is done using the actual right margin length.
gawk 'BEGIN {FS = ":"}
{print $1 | "sort"}' /etc/passwd
This program prints all users' login names in alphabetical order
gawk '{nlines++}
END {print nlines}'
This program prints the total number of lines in a file.
gawk 'END {print NR}'
This program also prints the total number of lines in a file, but the work of counting lines is done by gawk.
gawk '{print NR,$0}'
When this program prints the contents of a file, it prints the line number at the very beginning of each line. Its func-
tion is similar to 'cat -n'.
Chapter 11 Conclusion
gawk has very powerful capabilities for data processing. It can accomplish
what you want to do with very short programs; sometimes just one or two lines of code can complete the specified task. For the same piece
of work, writing it in gawk will be much shorter than writing it in other programming languages.
gawk is GNU's awk. It is Public Domain software and may be used free of charge.
[
Last edited by 无奈何 on 2006-10-27 at 02:40 AM ]