1\input texinfo @c -*-texinfo-*- 2@c $NetBSD: awk.texi,v 1.1 2010/12/13 06:21:53 mrg Exp $ 3@c %**start of header (This is for running Texinfo on a region.) 4@setfilename awk.info 5@settitle The GNU Awk User's Guide 6@c %**end of header (This is for running Texinfo on a region.) 7 8@dircategory Text creation and manipulation 9@direntry 10* Gawk: (awk). A text scanning and processing language. 11@end direntry 12@dircategory Individual utilities 13@direntry 14* awk: (awk)Invoking gawk. Text scanning and processing. 15@end direntry 16 17@set xref-automatic-section-title 18 19@c The following information should be updated here only! 20@c This sets the edition of the document, the version of gawk it 21@c applies to and all the info about who's publishing this edition 22 23@c These apply across the board. 24@set UPDATE-MONTH June, 2003 25@set VERSION 3.1 26@set PATCHLEVEL 3 27 28@set FSF 29 30@set TITLE GAWK: Effective AWK Programming 31@set SUBTITLE A User's Guide for GNU Awk 32@set EDITION 3 33 34@iftex 35@set DOCUMENT book 36@set CHAPTER chapter 37@set APPENDIX appendix 38@set SECTION section 39@set SUBSECTION subsection 40@set DARKCORNER @inmargin{@image{lflashlight,1cm}, @image{rflashlight,1cm}} 41@end iftex 42@ifinfo 43@set DOCUMENT Info file 44@set CHAPTER major node 45@set APPENDIX major node 46@set SECTION minor node 47@set SUBSECTION node 48@set DARKCORNER (d.c.) 49@end ifinfo 50@ifhtml 51@set DOCUMENT Web page 52@set CHAPTER chapter 53@set APPENDIX appendix 54@set SECTION section 55@set SUBSECTION subsection 56@set DARKCORNER (d.c.) 57@end ifhtml 58@ifxml 59@set DOCUMENT book 60@set CHAPTER chapter 61@set APPENDIX appendix 62@set SECTION section 63@set SUBSECTION subsection 64@set DARKCORNER (d.c.) 65@end ifxml 66 67@c some special symbols 68@iftex 69@set LEQ @math{@leq} 70@end iftex 71@ifnottex 72@set LEQ <= 73@end ifnottex 74 75@set FN file name 76@set FFN File Name 77@set DF data file 78@set DDF Data File 79@set PVERSION version 80@set CTL Ctrl 81 82@ignore 83Some comments on the layout for TeX. 841. Use at least texinfo.tex 2000-09-06.09 852. I have done A LOT of work to make this look good. There are `@page' commands 86 and use of `@group ... @end group' in a number of places. If you muck 87 with anything, it's your responsibility not to break the layout. 88@end ignore 89 90@c merge the function and variable indexes into the concept index 91@ifinfo 92@synindex fn cp 93@synindex vr cp 94@end ifinfo 95@iftex 96@syncodeindex fn cp 97@syncodeindex vr cp 98@end iftex 99@ifxml 100@syncodeindex fn cp 101@syncodeindex vr cp 102@end ifxml 103 104@c If "finalout" is commented out, the printed output will show 105@c black boxes that mark lines that are too long. Thus, it is 106@c unwise to comment it out when running a master in case there are 107@c overfulls which are deemed okay. 108 109@iftex 110@finalout 111@end iftex 112 113@copying 114Copyright @copyright{} 1989, 1991, 1992, 1993, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003 Free Software Foundation, Inc. 115@sp 2 116 117This is Edition @value{EDITION} of @cite{@value{TITLE}: @value{SUBTITLE}}, 118for the @value{VERSION}.@value{PATCHLEVEL} (or later) version of the GNU 119implementation of AWK. 120 121Permission is granted to copy, distribute and/or modify this document 122under the terms of the GNU Free Documentation License, Version 1.2 or 123any later version published by the Free Software Foundation; with the 124Invariant Sections being ``GNU General Public License'', the Front-Cover 125texts being (a) (see below), and with the Back-Cover Texts being (b) 126(see below). A copy of the license is included in the section entitled 127``GNU Free Documentation License''. 128 129@enumerate a 130@item 131``A GNU Manual'' 132 133@item 134``You have freedom to copy and modify this GNU Manual, like GNU 135software. Copies published by the Free Software Foundation raise 136funds for GNU development.'' 137@end enumerate 138@end copying 139 140@c Comment out the "smallbook" for technical review. Saves 141@c considerable paper. Remember to turn it back on *before* 142@c starting the page-breaking work. 143 144@c 4/2002: Karl Berry recommends commenting out this and the 145@c `@setchapternewpage odd', and letting users use `texi2dvi -t' 146@c if they want to waste paper. 147@c @smallbook 148 149 150@c Uncomment this for the release. Leaving it off saves paper 151@c during editing and review. 152@c @setchapternewpage odd 153 154@titlepage 155@title @value{TITLE} 156@subtitle @value{SUBTITLE} 157@subtitle Edition @value{EDITION} 158@subtitle @value{UPDATE-MONTH} 159@author Arnold D. Robbins 160 161@c Include the Distribution inside the titlepage environment so 162@c that headings are turned off. Headings on and off do not work. 163 164@page 165@vskip 0pt plus 1filll 166@ignore 167The programs and applications presented in this book have been 168included for their instructional value. They have been tested with care 169but are not guaranteed for any particular purpose. The publisher does not 170offer any warranties or representations, nor does it accept any 171liabilities with respect to the programs or applications. 172So there. 173@sp 2 174UNIX is a registered trademark of The Open Group in the United States and other countries. @* 175Microsoft, MS and MS-DOS are registered trademarks, and Windows is a 176trademark of Microsoft Corporation in the United States and other 177countries. @* 178Atari, 520ST, 1040ST, TT, STE, Mega and Falcon are registered trademarks 179or trademarks of Atari Corporation. @* 180DEC, Digital, OpenVMS, ULTRIX and VMS are trademarks of Digital Equipment 181Corporation. @* 182@end ignore 183``To boldly go where no man has gone before'' is a 184Registered Trademark of Paramount Pictures Corporation. @* 185@c sorry, i couldn't resist 186@sp 3 187Published by: 188@sp 1 189 190Free Software Foundation @* 19159 Temple Place --- Suite 330 @* 192Boston, MA 02111-1307 USA @* 193Phone: +1-617-542-5942 @* 194Fax: +1-617-542-2652 @* 195Email: @email{gnu@@gnu.org} @* 196URL: @uref{http://www.gnu.org/} @* 197 198@c This one is correct for gawk 3.1.0 from the FSF 199ISBN 1-882114-28-0 @* 200@sp 2 201@insertcopying 202@sp 2 203Cover art by Etienne Suvasa. 204@end titlepage 205 206@c Thanks to Bob Chassell for directions on doing dedications. 207@iftex 208@headings off 209@page 210@w{ } 211@sp 9 212@center @i{To Miriam, for making me complete.} 213@sp 1 214@center @i{To Chana, for the joy you bring us.} 215@sp 1 216@center @i{To Rivka, for the exponential increase.} 217@sp 1 218@center @i{To Nachum, for the added dimension.} 219@sp 1 220@center @i{To Malka, for the new beginning.} 221@w{ } 222@page 223@w{ } 224@page 225@headings on 226@end iftex 227 228@iftex 229@headings off 230@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| 231@oddheading @| @| @strong{@thischapter}@ @ @ @thispage 232@end iftex 233 234@ifnottex 235@ifnotxml 236@node Top 237@top General Introduction 238@c Preface node should come right after the Top 239@c node, in `unnumbered' sections, then the chapter, `What is gawk'. 240@c Licensing nodes are appendices, they're not central to AWK. 241 242This file documents @command{awk}, a program that you can use to select 243particular records in a file and perform operations upon them. 244 245@insertcopying 246 247@end ifnotxml 248@end ifnottex 249 250@menu 251* Foreword:: Some nice words about this 252 @value{DOCUMENT}. 253* Preface:: What this @value{DOCUMENT} is about; brief 254 history and acknowledgments. 255* Getting Started:: A basic introduction to using 256 @command{awk}. How to run an @command{awk} 257 program. Command-line syntax. 258* Regexp:: All about matching things using regular 259 expressions. 260* Reading Files:: How to read files and manipulate fields. 261* Printing:: How to print using @command{awk}. Describes 262 the @code{print} and @code{printf} 263 statements. Also describes redirection of 264 output. 265* Expressions:: Expressions are the basic building blocks 266 of statements. 267* Patterns and Actions:: Overviews of patterns and actions. 268* Arrays:: The description and use of arrays. Also 269 includes array-oriented control statements. 270* Functions:: Built-in and user-defined functions. 271* Internationalization:: Getting @command{gawk} to speak your 272 language. 273* Advanced Features:: Stuff for advanced users, specific to 274 @command{gawk}. 275* Invoking Gawk:: How to run @command{gawk}. 276* Library Functions:: A Library of @command{awk} Functions. 277* Sample Programs:: Many @command{awk} programs with complete 278 explanations. 279* Language History:: The evolution of the @command{awk} 280 language. 281* Installation:: Installing @command{gawk} under various 282 operating systems. 283* Notes:: Notes about @command{gawk} extensions and 284 possible future work. 285* Basic Concepts:: A very quick intoduction to programming 286 concepts. 287* Glossary:: An explanation of some unfamiliar terms. 288* Copying:: Your right to copy and distribute 289 @command{gawk}. 290* GNU Free Documentation License:: The license for this @value{DOCUMENT}. 291* Index:: Concept and Variable Index. 292 293@detailmenu 294* History:: The history of @command{gawk} and 295 @command{awk}. 296* Names:: What name to use to find @command{awk}. 297* This Manual:: Using this @value{DOCUMENT}. Includes 298 sample input files that you can use. 299* Conventions:: Typographical Conventions. 300* Manual History:: Brief history of the GNU project and this 301 @value{DOCUMENT}. 302* How To Contribute:: Helping to save the world. 303* Acknowledgments:: Acknowledgments. 304* Running gawk:: How to run @command{gawk} programs; 305 includes command-line syntax. 306* One-shot:: Running a short throwaway @command{awk} 307 program. 308* Read Terminal:: Using no input files (input from terminal 309 instead). 310* Long:: Putting permanent @command{awk} programs in 311 files. 312* Executable Scripts:: Making self-contained @command{awk} 313 programs. 314* Comments:: Adding documentation to @command{gawk} 315 programs. 316* Quoting:: More discussion of shell quoting issues. 317* Sample Data Files:: Sample data files for use in the 318 @command{awk} programs illustrated in this 319 @value{DOCUMENT}. 320* Very Simple:: A very simple example. 321* Two Rules:: A less simple one-line example using two 322 rules. 323* More Complex:: A more complex example. 324* Statements/Lines:: Subdividing or combining statements into 325 lines. 326* Other Features:: Other Features of @command{awk}. 327* When:: When to use @command{gawk} and when to use 328 other things. 329* Regexp Usage:: How to Use Regular Expressions. 330* Escape Sequences:: How to write nonprinting characters. 331* Regexp Operators:: Regular Expression Operators. 332* Character Lists:: What can go between @samp{[...]}. 333* GNU Regexp Operators:: Operators specific to GNU software. 334* Case-sensitivity:: How to do case-insensitive matching. 335* Leftmost Longest:: How much text matches. 336* Computed Regexps:: Using Dynamic Regexps. 337* Locales:: How the locale affects things. 338* Records:: Controlling how data is split into records. 339* Fields:: An introduction to fields. 340* Nonconstant Fields:: Nonconstant Field Numbers. 341* Changing Fields:: Changing the Contents of a Field. 342* Field Separators:: The field separator and how to change it. 343* Regexp Field Splitting:: Using regexps as the field separator. 344* Single Character Fields:: Making each character a separate field. 345* Command Line Field Separator:: Setting @code{FS} from the command-line. 346* Field Splitting Summary:: Some final points and a summary table. 347* Constant Size:: Reading constant width data. 348* Multiple Line:: Reading multi-line records. 349* Getline:: Reading files under explicit program 350 control using the @code{getline} function. 351* Plain Getline:: Using @code{getline} with no arguments. 352* Getline/Variable:: Using @code{getline} into a variable. 353* Getline/File:: Using @code{getline} from a file. 354* Getline/Variable/File:: Using @code{getline} into a variable from a 355 file. 356* Getline/Pipe:: Using @code{getline} from a pipe. 357* Getline/Variable/Pipe:: Using @code{getline} into a variable from a 358 pipe. 359* Getline/Coprocess:: Using @code{getline} from a coprocess. 360* Getline/Variable/Coprocess:: Using @code{getline} into a variable from a 361 coprocess. 362* Getline Notes:: Important things to know about 363 @code{getline}. 364* Getline Summary:: Summary of @code{getline} Variants. 365* Print:: The @code{print} statement. 366* Print Examples:: Simple examples of @code{print} statements. 367* Output Separators:: The output separators and how to change 368 them. 369* OFMT:: Controlling Numeric Output With 370 @code{print}. 371* Printf:: The @code{printf} statement. 372* Basic Printf:: Syntax of the @code{printf} statement. 373* Control Letters:: Format-control letters. 374* Format Modifiers:: Format-specification modifiers. 375* Printf Examples:: Several examples. 376* Redirection:: How to redirect output to multiple files 377 and pipes. 378* Special Files:: File name interpretation in @command{gawk}. 379 @command{gawk} allows access to inherited 380 file descriptors. 381* Special FD:: Special files for I/O. 382* Special Process:: Special files for process information. 383* Special Network:: Special files for network communications. 384* Special Caveats:: Things to watch out for. 385* Close Files And Pipes:: Closing Input and Output Files and Pipes. 386* Constants:: String, numeric and regexp constants. 387* Scalar Constants:: Numeric and string constants. 388* Nondecimal-numbers:: What are octal and hex numbers. 389* Regexp Constants:: Regular Expression constants. 390* Using Constant Regexps:: When and how to use a regexp constant. 391* Variables:: Variables give names to values for later 392 use. 393* Using Variables:: Using variables in your programs. 394* Assignment Options:: Setting variables on the command-line and a 395 summary of command-line syntax. This is an 396 advanced method of input. 397* Conversion:: The conversion of strings to numbers and 398 vice versa. 399* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-}, 400 etc.) 401* Concatenation:: Concatenating strings. 402* Assignment Ops:: Changing the value of a variable or a 403 field. 404* Increment Ops:: Incrementing the numeric value of a 405 variable. 406* Truth Values:: What is ``true'' and what is ``false''. 407* Typing and Comparison:: How variables acquire types and how this 408 affects comparison of numbers and strings 409 with @samp{<}, etc. 410* Boolean Ops:: Combining comparison expressions using 411 boolean operators @samp{||} (``or''), 412 @samp{&&} (``and'') and @samp{!} (``not''). 413* Conditional Exp:: Conditional expressions select between two 414 subexpressions under control of a third 415 subexpression. 416* Function Calls:: A function call is an expression. 417* Precedence:: How various operators nest. 418* Pattern Overview:: What goes into a pattern. 419* Regexp Patterns:: Using regexps as patterns. 420* Expression Patterns:: Any expression can be used as a pattern. 421* Ranges:: Pairs of patterns specify record ranges. 422* BEGIN/END:: Specifying initialization and cleanup 423 rules. 424* Using BEGIN/END:: How and why to use BEGIN/END rules. 425* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. 426* Empty:: The empty pattern, which matches every 427 record. 428* Using Shell Variables:: How to use shell variables with 429 @command{awk}. 430* Action Overview:: What goes into an action. 431* Statements:: Describes the various control statements in 432 detail. 433* If Statement:: Conditionally execute some @command{awk} 434 statements. 435* While Statement:: Loop until some condition is satisfied. 436* Do Statement:: Do specified action while looping until 437 some condition is satisfied. 438* For Statement:: Another looping statement, that provides 439 initialization and increment clauses. 440* Switch Statement:: Switch/case evaluation for conditional 441 execution of statements based on a value. 442* Break Statement:: Immediately exit the innermost enclosing 443 loop. 444* Continue Statement:: Skip to the end of the innermost enclosing 445 loop. 446* Next Statement:: Stop processing the current input record. 447* Nextfile Statement:: Stop processing the current file. 448* Exit Statement:: Stop execution of @command{awk}. 449* Built-in Variables:: Summarizes the built-in variables. 450* User-modified:: Built-in variables that you change to 451 control @command{awk}. 452* Auto-set:: Built-in variables where @command{awk} 453 gives you information. 454* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}. 455* Array Intro:: Introduction to Arrays 456* Reference to Elements:: How to examine one element of an array. 457* Assigning Elements:: How to change an element of an array. 458* Array Example:: Basic Example of an Array 459* Scanning an Array:: A variation of the @code{for} statement. It 460 loops through the indices of an array's 461 existing elements. 462* Delete:: The @code{delete} statement removes an 463 element from an array. 464* Numeric Array Subscripts:: How to use numbers as subscripts in 465 @command{awk}. 466* Uninitialized Subscripts:: Using Uninitialized variables as 467 subscripts. 468* Multi-dimensional:: Emulating multidimensional arrays in 469 @command{awk}. 470* Multi-scanning:: Scanning multidimensional arrays. 471* Array Sorting:: Sorting array values and indices. 472* Built-in:: Summarizes the built-in functions. 473* Calling Built-in:: How to call built-in functions. 474* Numeric Functions:: Functions that work with numbers, including 475 @code{int}, @code{sin} and @code{rand}. 476* String Functions:: Functions for string manipulation, such as 477 @code{split}, @code{match} and 478 @code{sprintf}. 479* Gory Details:: More than you want to know about @samp{\} 480 and @samp{&} with @code{sub}, @code{gsub}, 481 and @code{gensub}. 482* I/O Functions:: Functions for files and shell commands. 483* Time Functions:: Functions for dealing with timestamps. 484* Bitwise Functions:: Functions for bitwise operations. 485* I18N Functions:: Functions for string translation. 486* User-defined:: Describes User-defined functions in detail. 487* Definition Syntax:: How to write definitions and what they 488 mean. 489* Function Example:: An example function definition and what it 490 does. 491* Function Caveats:: Things to watch out for. 492* Return Statement:: Specifying the value a function returns. 493* Dynamic Typing:: How variable types can change at runtime. 494* I18N and L10N:: Internationalization and Localization. 495* Explaining gettext:: How GNU @code{gettext} works. 496* Programmer i18n:: Features for the programmer. 497* Translator i18n:: Features for the translator. 498* String Extraction:: Extracting marked strings. 499* Printf Ordering:: Rearranging @code{printf} arguments. 500* I18N Portability:: @command{awk}-level portability issues. 501* I18N Example:: A simple i18n example. 502* Gawk I18N:: @command{gawk} is also internationalized. 503* Nondecimal Data:: Allowing nondecimal input data. 504* Two-way I/O:: Two-way communications with another 505 process. 506* TCP/IP Networking:: Using @command{gawk} for network 507 programming. 508* Portal Files:: Using @command{gawk} with BSD portals. 509* Profiling:: Profiling your @command{awk} programs. 510* Command Line:: How to run @command{awk}. 511* Options:: Command-line options and their meanings. 512* Other Arguments:: Input file names and variable assignments. 513* AWKPATH Variable:: Searching directories for @command{awk} 514 programs. 515* Obsolete:: Obsolete Options and/or features. 516* Undocumented:: Undocumented Options and Features. 517* Known Bugs:: Known Bugs in @command{gawk}. 518* Library Names:: How to best name private global variables 519 in library functions. 520* General Functions:: Functions that are of general use. 521* Nextfile Function:: Two implementations of a @code{nextfile} 522 function. 523* Assert Function:: A function for assertions in @command{awk} 524 programs. 525* Round Function:: A function for rounding if @code{sprintf} 526 does not do it correctly. 527* Cliff Random Function:: The Cliff Random Number Generator. 528* Ordinal Functions:: Functions for using characters as numbers 529 and vice versa. 530* Join Function:: A function to join an array into a string. 531* Gettimeofday Function:: A function to get formatted times. 532* Data File Management:: Functions for managing command-line data 533 files. 534* Filetrans Function:: A function for handling data file 535 transitions. 536* Rewind Function:: A function for rereading the current file. 537* File Checking:: Checking that data files are readable. 538* Empty Files:: Checking for zero-length files. 539* Ignoring Assigns:: Treating assignments as file names. 540* Getopt Function:: A function for processing command-line 541 arguments. 542* Passwd Functions:: Functions for getting user information. 543* Group Functions:: Functions for getting group information. 544* Running Examples:: How to run these examples. 545* Clones:: Clones of common utilities. 546* Cut Program:: The @command{cut} utility. 547* Egrep Program:: The @command{egrep} utility. 548* Id Program:: The @command{id} utility. 549* Split Program:: The @command{split} utility. 550* Tee Program:: The @command{tee} utility. 551* Uniq Program:: The @command{uniq} utility. 552* Wc Program:: The @command{wc} utility. 553* Miscellaneous Programs:: Some interesting @command{awk} programs. 554* Dupword Program:: Finding duplicated words in a document. 555* Alarm Program:: An alarm clock. 556* Translate Program:: A program similar to the @command{tr} 557 utility. 558* Labels Program:: Printing mailing labels. 559* Word Sorting:: A program to produce a word usage count. 560* History Sorting:: Eliminating duplicate entries from a 561 history file. 562* Extract Program:: Pulling out programs from Texinfo source 563 files. 564* Simple Sed:: A Simple Stream Editor. 565* Igawk Program:: A wrapper for @command{awk} that includes 566 files. 567* V7/SVR3.1:: The major changes between V7 and System V 568 Release 3.1. 569* SVR4:: Minor changes between System V Releases 3.1 570 and 4. 571* POSIX:: New features from the POSIX standard. 572* BTL:: New features from the Bell Laboratories 573 version of @command{awk}. 574* POSIX/GNU:: The extensions in @command{gawk} not in 575 POSIX @command{awk}. 576* Contributors:: The major contributors to @command{gawk}. 577* Gawk Distribution:: What is in the @command{gawk} distribution. 578* Getting:: How to get the distribution. 579* Extracting:: How to extract the distribution. 580* Distribution contents:: What is in the distribution. 581* Unix Installation:: Installing @command{gawk} under various 582 versions of Unix. 583* Quick Installation:: Compiling @command{gawk} under Unix. 584* Additional Configuration Options:: Other compile-time options. 585* Configuration Philosophy:: How it's all supposed to work. 586* Non-Unix Installation:: Installation on Other Operating Systems. 587* Amiga Installation:: Installing @command{gawk} on an Amiga. 588* BeOS Installation:: Installing @command{gawk} on BeOS. 589* PC Installation:: Installing and Compiling @command{gawk} on 590 MS-DOS and OS/2. 591* PC Binary Installation:: Installing a prepared distribution. 592* PC Compiling:: Compiling @command{gawk} for MS-DOS, Windows32, 593 and OS/2. 594* PC Using:: Running @command{gawk} on MS-DOS, Windows32 and 595 OS/2. 596* PC Dynamic:: Compiling @command{gawk} for dynamic 597 libraries. 598* Cygwin:: Building and running @command{gawk} for 599 Cygwin. 600* VMS Installation:: Installing @command{gawk} on VMS. 601* VMS Compilation:: How to compile @command{gawk} under VMS. 602* VMS Installation Details:: How to install @command{gawk} under VMS. 603* VMS Running:: How to run @command{gawk} under VMS. 604* VMS POSIX:: Alternate instructions for VMS POSIX. 605* Unsupported:: Systems whose ports are no longer 606 supported. 607* Atari Installation:: Installing @command{gawk} on the Atari ST. 608* Atari Compiling:: Compiling @command{gawk} on Atari. 609* Atari Using:: Running @command{gawk} on Atari. 610* Tandem Installation:: Installing @command{gawk} on a Tandem. 611* Bugs:: Reporting Problems and Bugs. 612* Other Versions:: Other freely available @command{awk} 613 implementations. 614* Compatibility Mode:: How to disable certain @command{gawk} 615 extensions. 616* Additions:: Making Additions To @command{gawk}. 617* Adding Code:: Adding code to the main body of 618 @command{gawk}. 619* New Ports:: Porting @command{gawk} to a new operating 620 system. 621* Dynamic Extensions:: Adding new built-in functions to 622 @command{gawk}. 623* Internals:: A brief look at some @command{gawk} 624 internals. 625* Sample Library:: A example of new functions. 626* Internal File Description:: What the new functions will do. 627* Internal File Ops:: The code for internal file operations. 628* Using Internal File Ops:: How to use an external extension. 629* Future Extensions:: New features that may be implemented one 630 day. 631* Basic High Level:: The high level view. 632* Basic Data Typing:: A very quick intro to data types. 633* Floating Point Issues:: Stuff to know about floating-point numbers. 634@end detailmenu 635@end menu 636 637@c dedication for Info file 638@ifinfo 639@center To Miriam, for making me complete. 640@sp 1 641@center To Chana, for the joy you bring us. 642@sp 1 643@center To Rivka, for the exponential increase. 644@sp 1 645@center To Nachum, for the added dimension. 646@sp 1 647@center To Malka, for the new beginning. 648@end ifinfo 649 650@summarycontents 651@contents 652 653@node Foreword 654@unnumbered Foreword 655 656Arnold Robbins and I are good friends. We were introduced 11 years ago 657by circumstances---and our favorite programming language, AWK. 658The circumstances started a couple of years 659earlier. I was working at a new job and noticed an unplugged 660Unix computer sitting in the corner. No one knew how to use it, 661and neither did I. However, 662a couple of days later it was running, and 663I was @code{root} and the one-and-only user. 664That day, I began the transition from statistician to Unix programmer. 665 666On one of many trips to the library or bookstore in search of 667books on Unix, I found the gray AWK book, a.k.a. Aho, Kernighan and 668Weinberger, @cite{The AWK Programming Language}, Addison-Wesley, 6691988. AWK's simple programming paradigm---find a pattern in the 670input and then perform an action---often reduced complex or tedious 671data manipulations to few lines of code. I was excited to try my 672hand at programming in AWK. 673 674Alas, the @command{awk} on my computer was a limited version of the 675language described in the AWK book. I discovered that my computer 676had ``old @command{awk}'' and the AWK book described ``new @command{awk}.'' 677I learned that this was typical; the old version refused to step 678aside or relinquish its name. If a system had a new @command{awk}, it was 679invariably called @command{nawk}, and few systems had it. 680The best way to get a new @command{awk} was to @command{ftp} the source code for 681@command{gawk} from @code{prep.ai.mit.edu}. @command{gawk} was a version of 682new @command{awk} written by David Trueman and Arnold, and available under 683the GNU General Public License. 684 685(Incidentally, 686it's no longer difficult to find a new @command{awk}. @command{gawk} ships with 687Linux, and you can download binaries or source code for almost 688any system; my wife uses @command{gawk} on her VMS box.) 689 690My Unix system started out unplugged from the wall; it certainly was not 691plugged into a network. So, oblivious to the existence of @command{gawk} 692and the Unix community in general, and desiring a new @command{awk}, I wrote 693my own, called @command{mawk}. 694Before I was finished I knew about @command{gawk}, 695but it was too late to stop, so I eventually posted 696to a @code{comp.sources} newsgroup. 697 698A few days after my posting, I got a friendly email 699from Arnold introducing 700himself. He suggested we share design and algorithms and 701attached a draft of the POSIX standard so 702that I could update @command{mawk} to support language extensions added 703after publication of the AWK book. 704 705Frankly, if our roles had 706been reversed, I would not have been so open and we probably would 707have never met. I'm glad we did meet. 708He is an AWK expert's AWK expert and a genuinely nice person. 709Arnold contributes significant amounts of his 710expertise and time to the Free Software Foundation. 711 712This book is the @command{gawk} reference manual, but at its core it 713is a book about AWK programming that 714will appeal to a wide audience. 715It is a definitive reference to the AWK language as defined by the 7161987 Bell Labs release and codified in the 1992 POSIX Utilities 717standard. 718 719On the other hand, the novice AWK programmer can study 720a wealth of practical programs that emphasize 721the power of AWK's basic idioms: 722data driven control-flow, pattern matching with regular expressions, 723and associative arrays. 724Those looking for something new can try out @command{gawk}'s 725interface to network protocols via special @file{/inet} files. 726 727The programs in this book make clear that an AWK program is 728typically much smaller and faster to develop than 729a counterpart written in C. 730Consequently, there is often a payoff to prototype an 731algorithm or design in AWK to get it running quickly and expose 732problems early. Often, the interpreted performance is adequate 733and the AWK prototype becomes the product. 734 735The new @command{pgawk} (profiling @command{gawk}), produces 736program execution counts. 737I recently experimented with an algorithm that for 738@math{n} lines of input, exhibited 739@tex 740$\sim\! Cn^2$ 741@end tex 742@ifnottex 743~ C n^2 744@end ifnottex 745performance, while 746theory predicted 747@tex 748$\sim\! Cn\log n$ 749@end tex 750@ifnottex 751~ C n log n 752@end ifnottex 753behavior. A few minutes poring 754over the @file{awkprof.out} profile pinpointed the problem to 755a single line of code. @command{pgawk} is a welcome addition to 756my programmer's toolbox. 757 758Arnold has distilled over a decade of experience writing and 759using AWK programs, and developing @command{gawk}, into this book. If you use 760AWK or want to learn how, then read this book. 761 762@display 763Michael Brennan 764Author of @command{mawk} 765@end display 766 767@node Preface 768@unnumbered Preface 769@c I saw a comment somewhere that the preface should describe the book itself, 770@c and the introduction should describe what the book covers. 771@c 772@c 12/2000: Chuck wants the preface & intro combined. 773 774Several kinds of tasks occur repeatedly 775when working with text files. 776You might want to extract certain lines and discard the rest. 777Or you may need to make changes wherever certain patterns appear, 778but leave the rest of the file alone. 779Writing single-use programs for these tasks in languages such as C, C++, or Pascal 780is time-consuming and inconvenient. 781Such jobs are often easier with @command{awk}. 782The @command{awk} utility interprets a special-purpose programming language 783that makes it easy to handle simple data-reformatting jobs. 784 785The GNU implementation of @command{awk} is called @command{gawk}; it is fully 786compatible with the System V Release 4 version of 787@command{awk}. @command{gawk} is also compatible with the POSIX 788specification of the @command{awk} language. This means that all 789properly written @command{awk} programs should work with @command{gawk}. 790Thus, we usually don't distinguish between @command{gawk} and other 791@command{awk} implementations. 792 793@cindex @command{awk}, POSIX and, See Also POSIX @command{awk} 794@cindex @command{awk}, POSIX and 795@cindex POSIX, @command{awk} and 796@cindex @command{gawk}, @command{awk} and 797@cindex @command{awk}, @command{gawk} and 798@cindex @command{awk}, uses for 799Using @command{awk} allows you to: 800 801@itemize @bullet 802@item 803Manage small, personal databases 804 805@item 806Generate reports 807 808@item 809Validate data 810 811@item 812Produce indexes and perform other document preparation tasks 813 814@item 815Experiment with algorithms that you can adapt later to other computer 816languages 817@end itemize 818 819@cindex @command{awk}, See Also @command{gawk} 820@cindex @command{gawk}, See Also @command{awk} 821@cindex @command{gawk}, uses for 822In addition, 823@command{gawk} 824provides facilities that make it easy to: 825 826@itemize @bullet 827@item 828Extract bits and pieces of data for processing 829 830@item 831Sort data 832 833@item 834Perform simple network communications 835@end itemize 836 837This @value{DOCUMENT} teaches you about the @command{awk} language and 838how you can use it effectively. You should already be familiar with basic 839system commands, such as @command{cat} and @command{ls},@footnote{These commands 840are available on POSIX-compliant systems, as well as on traditional 841Unix-based systems. If you are using some other operating system, you still need to 842be familiar with the ideas of I/O redirection and pipes.} as well as basic shell 843facilities, such as input/output (I/O) redirection and pipes. 844 845@cindex GNU @command{awk}, See @command{gawk} 846Implementations of the @command{awk} language are available for many 847different computing environments. This @value{DOCUMENT}, while describing 848the @command{awk} language in general, also describes the particular 849implementation of @command{awk} called @command{gawk} (which stands for 850``GNU awk''). @command{gawk} runs on a broad range of Unix systems, 851ranging from 80386 PC-based computers up through large-scale systems, 852such as Crays. @command{gawk} has also been ported to Mac OS X, 853MS-DOS, Microsoft Windows (all versions) and OS/2 PCs, Atari and Amiga 854microcomputers, BeOS, Tandem D20, and VMS. 855 856@menu 857* History:: The history of @command{gawk} and 858 @command{awk}. 859* Names:: What name to use to find @command{awk}. 860* This Manual:: Using this @value{DOCUMENT}. Includes sample 861 input files that you can use. 862* Conventions:: Typographical Conventions. 863* Manual History:: Brief history of the GNU project and this 864 @value{DOCUMENT}. 865* How To Contribute:: Helping to save the world. 866* Acknowledgments:: Acknowledgments. 867@end menu 868 869@node History 870@unnumberedsec History of @command{awk} and @command{gawk} 871@cindex recipe for a programming language 872@cindex programming language, recipe for 873@center Recipe For A Programming Language 874 875@multitable {2 parts} {1 part @code{egrep}} {1 part @code{snobol}} 876@item @tab 1 part @code{egrep} @tab 1 part @code{snobol} 877@item @tab 2 parts @code{ed} @tab 3 parts C 878@end multitable 879 880@quotation 881Blend all parts well using @code{lex} and @code{yacc}. 882Document minimally and release. 883 884After eight years, add another part @code{egrep} and two 885more parts C. Document very well and release. 886@end quotation 887 888@cindex Aho, Alfred 889@cindex Weinberger, Peter 890@cindex Kernighan, Brian 891@cindex @command{awk}, history of 892The name @command{awk} comes from the initials of its designers: Alfred V.@: 893Aho, Peter J.@: Weinberger and Brian W.@: Kernighan. The original version of 894@command{awk} was written in 1977 at AT&T Bell Laboratories. 895In 1985, a new version made the programming 896language more powerful, introducing user-defined functions, multiple input 897streams, and computed regular expressions. 898This new version became widely available with Unix System V 899Release 3.1 (SVR3.1). 900The version in SVR4 added some new features and cleaned 901up the behavior in some of the ``dark corners'' of the language. 902The specification for @command{awk} in the POSIX Command Language 903and Utilities standard further clarified the language. 904Both the @command{gawk} designers and the original Bell Laboratories @command{awk} 905designers provided feedback for the POSIX specification. 906 907@cindex Rubin, Paul 908@cindex Fenlason, Jay 909@cindex Trueman, David 910Paul Rubin wrote the GNU implementation, @command{gawk}, in 1986. 911Jay Fenlason completed it, with advice from Richard Stallman. John Woods 912contributed parts of the code as well. In 1988 and 1989, David Trueman, with 913help from me, thoroughly reworked @command{gawk} for compatibility 914with the newer @command{awk}. 915Circa 1995, I became the primary maintainer. 916Current development focuses on bug fixes, 917performance improvements, standards compliance, and occasionally, new features. 918 919In May of 1997, J@"urgen Kahrs felt the need for network access 920from @command{awk}, and with a little help from me, set about adding 921features to do this for @command{gawk}. At that time, he also 922wrote the bulk of 923@cite{TCP/IP Internetworking with @command{gawk}} 924(a separate document, available as part of the @command{gawk} distribution). 925His code finally became part of the main @command{gawk} distribution 926with @command{gawk} @value{PVERSION} 3.1. 927 928@xref{Contributors}, 929for a complete list of those who made important contributions to @command{gawk}. 930 931@node Names 932@section A Rose by Any Other Name 933 934@cindex @command{awk}, new vs. old 935The @command{awk} language has evolved over the years. Full details are 936provided in @ref{Language History}. 937The language described in this @value{DOCUMENT} 938is often referred to as ``new @command{awk}'' (@command{nawk}). 939 940@cindex @command{awk}, versions of 941Because of this, many systems have multiple 942versions of @command{awk}. 943Some systems have an @command{awk} utility that implements the 944original version of the @command{awk} language and a @command{nawk} utility 945for the new 946version. 947Others have an @command{oawk} version for the ``old @command{awk}'' 948language and plain @command{awk} for the new one. Still others only 949have one version, which is usually the new one.@footnote{Often, these systems 950use @command{gawk} for their @command{awk} implementation!} 951 952@cindex @command{nawk} utility 953@cindex @command{oawk} utility 954All in all, this makes it difficult for you to know which version of 955@command{awk} you should run when writing your programs. The best advice 956I can give here is to check your local documentation. Look for @command{awk}, 957@command{oawk}, and @command{nawk}, as well as for @command{gawk}. 958It is likely that you already 959have some version of new @command{awk} on your system, which is what 960you should use when running your programs. (Of course, if you're reading 961this @value{DOCUMENT}, chances are good that you have @command{gawk}!) 962 963Throughout this @value{DOCUMENT}, whenever we refer to a language feature 964that should be available in any complete implementation of POSIX @command{awk}, 965we simply use the term @command{awk}. When referring to a feature that is 966specific to the GNU implementation, we use the term @command{gawk}. 967 968@node This Manual 969@section Using This Book 970@cindex @command{awk}, terms describing 971 972The term @command{awk} refers to a particular program as well as to the language you 973use to tell this program what to do. When we need to be careful, we call 974the language ``the @command{awk} language,'' 975and the program ``the @command{awk} utility.'' 976This @value{DOCUMENT} explains 977both the @command{awk} language and how to run the @command{awk} utility. 978The term @dfn{@command{awk} program} refers to a program written by you in 979the @command{awk} programming language. 980 981@cindex @command{gawk}, @command{awk} and 982@cindex @command{awk}, @command{gawk} and 983@cindex POSIX @command{awk} 984Primarily, this @value{DOCUMENT} explains the features of @command{awk}, 985as defined in the POSIX standard. It does so in the context of the 986@command{gawk} implementation. While doing so, it also 987attempts to describe important differences between @command{gawk} 988and other @command{awk} implementations.@footnote{All such differences 989appear in the index under the 990entry ``differences in @command{awk} and @command{gawk}.''} 991Finally, any @command{gawk} features that are not in 992the POSIX standard for @command{awk} are noted. 993 994@ifnotinfo 995This @value{DOCUMENT} has the difficult task of being both a tutorial and a reference. 996If you are a novice, feel free to skip over details that seem too complex. 997You should also ignore the many cross-references; they are for the 998expert user and for the online Info version of the document. 999@end ifnotinfo 1000 1001There are 1002subsections labelled 1003as @strong{Advanced Notes} 1004scattered throughout the @value{DOCUMENT}. 1005They add a more complete explanation of points that are relevant, but not likely 1006to be of interest on first reading. 1007All appear in the index, under the heading ``advanced features.'' 1008 1009Most of the time, the examples use complete @command{awk} programs. 1010In some of the more advanced sections, only the part of the @command{awk} 1011program that illustrates the concept currently being described is shown. 1012 1013While this @value{DOCUMENT} is aimed principally at people who have not been 1014exposed 1015to @command{awk}, there is a lot of information here that even the @command{awk} 1016expert should find useful. In particular, the description of POSIX 1017@command{awk} and the example programs in 1018@ref{Library Functions}, and in 1019@ref{Sample Programs}, 1020should be of interest. 1021 1022@ref{Getting Started}, 1023provides the essentials you need to know to begin using @command{awk}. 1024 1025@ref{Regexp}, 1026introduces regular expressions in general, and in particular the flavors 1027supported by POSIX @command{awk} and @command{gawk}. 1028 1029@ref{Reading Files}, 1030describes how @command{awk} reads your data. 1031It introduces the concepts of records and fields, as well 1032as the @code{getline} command. 1033I/O redirection is first described here. 1034 1035@ref{Printing}, 1036describes how @command{awk} programs can produce output with 1037@code{print} and @code{printf}. 1038 1039@ref{Expressions}, 1040describes expressions, which are the basic building blocks 1041for getting most things done in a program. 1042 1043@ref{Patterns and Actions}, 1044describes how to write patterns for matching records, actions for 1045doing something when a record is matched, and the built-in variables 1046@command{awk} and @command{gawk} use. 1047 1048@ref{Arrays}, 1049covers @command{awk}'s one-and-only data structure: associative arrays. 1050Deleting array elements and whole arrays is also described, as well as 1051sorting arrays in @command{gawk}. 1052 1053@ref{Functions}, 1054describes the built-in functions @command{awk} and 1055@command{gawk} provide, as well as how to define 1056your own functions. 1057 1058@ref{Internationalization}, 1059describes special features in @command{gawk} for translating program 1060messages into different languages at runtime. 1061 1062@ref{Advanced Features}, 1063describes a number of @command{gawk}-specific advanced features. 1064Of particular note 1065are the abilities to have two-way communications with another process, 1066perform TCP/IP networking, and 1067profile your @command{awk} programs. 1068 1069@ref{Invoking Gawk}, 1070describes how to run @command{gawk}, the meaning of its 1071command-line options, and how it finds @command{awk} 1072program source files. 1073 1074@ref{Library Functions}, and 1075@ref{Sample Programs}, 1076provide many sample @command{awk} programs. 1077Reading them allows you to see @command{awk} 1078solving real problems. 1079 1080@ref{Language History}, 1081describes how the @command{awk} language has evolved since 1082first release to present. It also describes how @command{gawk} 1083has acquired features over time. 1084 1085@ref{Installation}, 1086describes how to get @command{gawk}, how to compile it 1087under Unix, and how to compile and use it on different 1088non-Unix systems. It also describes how to report bugs 1089in @command{gawk} and where to get three other freely 1090available implementations of @command{awk}. 1091 1092@ref{Notes}, 1093describes how to disable @command{gawk}'s extensions, as 1094well as how to contribute new code to @command{gawk}, 1095how to write extension libraries, and some possible 1096future directions for @command{gawk} development. 1097 1098@ref{Basic Concepts}, 1099provides some very cursory background material for those who 1100are completely unfamiliar with computer programming. 1101Also centralized there is a discussion of some of the issues 1102surrounding floating-point numbers. 1103 1104The 1105@ref{Glossary}, 1106defines most, if not all, the significant terms used 1107throughout the book. 1108If you find terms that you aren't familiar with, try looking them up here. 1109 1110@ref{Copying}, and 1111@ref{GNU Free Documentation License}, 1112present the licenses that cover the @command{gawk} source code 1113and this @value{DOCUMENT}, respectively. 1114 1115@node Conventions 1116@section Typographical Conventions 1117 1118@cindex Texinfo 1119This @value{DOCUMENT} is written using Texinfo, the GNU documentation 1120formatting language. 1121A single Texinfo source file is used to produce both the printed and online 1122versions of the documentation. 1123@ifnotinfo 1124Because of this, the typographical conventions 1125are slightly different than in other books you may have read. 1126@end ifnotinfo 1127@ifinfo 1128This @value{SECTION} briefly documents the typographical conventions used in Texinfo. 1129@end ifinfo 1130 1131Examples you would type at the command-line are preceded by the common 1132shell primary and secondary prompts, @samp{$} and @samp{>}. 1133Output from the command is preceded by the glyph ``@print{}''. 1134This typically represents the command's standard output. 1135Error messages, and other output on the command's standard error, are preceded 1136by the glyph ``@error{}''. For example: 1137 1138@example 1139$ echo hi on stdout 1140@print{} hi on stdout 1141$ echo hello on stderr 1>&2 1142@error{} hello on stderr 1143@end example 1144 1145@ifnotinfo 1146In the text, command names appear in @code{this font}, while code segments 1147appear in the same font and quoted, @samp{like this}. Some things are 1148emphasized @emph{like this}, and if a point needs to be made 1149strongly, it is done @strong{like this}. The first occurrence of 1150a new term is usually its @dfn{definition} and appears in the same 1151font as the previous occurrence of ``definition'' in this sentence. 1152@value{FN}s are indicated like this: @file{/path/to/ourfile}. 1153@end ifnotinfo 1154 1155Characters that you type at the keyboard look @kbd{like this}. In particular, 1156there are special characters called ``control characters.'' These are 1157characters that you type by holding down both the @kbd{CONTROL} key and 1158another key, at the same time. For example, a @kbd{@value{CTL}-d} is typed 1159by first pressing and holding the @kbd{CONTROL} key, next 1160pressing the @kbd{d} key and finally releasing both keys. 1161 1162@c fakenode --- for prepinfo 1163@subsubheading Dark Corners 1164@cindex Kernighan, Brian 1165@quotation 1166@i{Dark corners are basically fractal --- no matter how much 1167you illuminate, there's always a smaller but darker one.}@* 1168Brian Kernighan 1169@end quotation 1170 1171@cindex d.c., See dark corner 1172@cindex dark corner 1173Until the POSIX standard (and @cite{The Gawk Manual}), 1174many features of @command{awk} were either poorly documented or not 1175documented at all. Descriptions of such features 1176(often called ``dark corners'') are noted in this @value{DOCUMENT} with 1177@iftex 1178the picture of a flashlight in the margin, as shown here. 1179@value{DARKCORNER} 1180@end iftex 1181@ifnottex 1182``(d.c.)''. 1183@end ifnottex 1184They also appear in the index under the heading ``dark corner.'' 1185 1186As noted by the opening quote, though, any 1187coverage of dark corners 1188is, by definition, something that is incomplete. 1189 1190@node Manual History 1191@unnumberedsec The GNU Project and This Book 1192 1193@cindex FSF (Free Software Foundation) 1194@cindex Free Software Foundation (FSF) 1195@cindex Stallman, Richard 1196The Free Software Foundation (FSF) is a nonprofit organization dedicated 1197to the production and distribution of freely distributable software. 1198It was founded by Richard M.@: Stallman, the author of the original 1199Emacs editor. GNU Emacs is the most widely used version of Emacs today. 1200 1201@cindex GNU Project 1202@cindex GPL (General Public License) 1203@cindex General Public License, See GPL 1204@cindex documentation, online 1205The GNU@footnote{GNU stands for ``GNU's not Unix.''} 1206Project is an ongoing effort on the part of the Free Software 1207Foundation to create a complete, freely distributable, POSIX-compliant 1208computing environment. 1209The FSF uses the ``GNU General Public License'' (GPL) to ensure that 1210their software's 1211source code is always available to the end user. A 1212copy of the GPL is included 1213@ifnotinfo 1214in this @value{DOCUMENT} 1215@end ifnotinfo 1216for your reference 1217(@pxref{Copying}). 1218The GPL applies to the C language source code for @command{gawk}. 1219To find out more about the FSF and the GNU Project online, 1220see @uref{http://www.gnu.org, the GNU Project's home page}. 1221This @value{DOCUMENT} may also be read from 1222@uref{http://www.gnu.org/manual/gawk/, their web site}. 1223 1224A shell, an editor (Emacs), highly portable optimizing C, C++, and 1225Objective-C compilers, a symbolic debugger and dozens of large and 1226small utilities (such as @command{gawk}), have all been completed and are 1227freely available. The GNU operating 1228system kernel (the HURD), has been released but is still in an early 1229stage of development. 1230 1231@cindex Linux 1232@cindex GNU/Linux 1233@cindex operating systems, BSD-based 1234@cindex Alpha (DEC) 1235Until the GNU operating system is more fully developed, you should 1236consider using GNU/Linux, a freely distributable, Unix-like operating 1237system for Intel 80386, DEC Alpha, Sun SPARC, IBM S/390, and other 1238systems.@footnote{The terminology ``GNU/Linux'' is explained 1239in the @ref{Glossary}.} 1240There are 1241many books on GNU/Linux. One that is freely available is @cite{Linux 1242Installation and Getting Started}, by Matt Welsh. 1243Many GNU/Linux distributions are often available in computer stores or 1244bundled on CD-ROMs with books about Linux. 1245(There are three other freely available, Unix-like operating systems for 124680386 and other systems: NetBSD, FreeBSD, and OpenBSD. All are based on the 12474.4-Lite Berkeley Software Distribution, and they use recent versions 1248of @command{gawk} for their versions of @command{awk}.) 1249 1250@ifnotinfo 1251The @value{DOCUMENT} you are reading is actually free---at least, the 1252information in it is free to anyone. The machine-readable 1253source code for the @value{DOCUMENT} comes with @command{gawk}; anyone 1254may take this @value{DOCUMENT} to a copying machine and make as many 1255copies as they like. (Take a moment to check the Free Documentation 1256License in @ref{GNU Free Documentation License}.) 1257 1258Although you could just print it out yourself, bound books are much 1259easier to read and use. Furthermore, 1260the proceeds from sales of this book go back to the FSF 1261to help fund development of more free software. 1262@end ifnotinfo 1263 1264@ignore 1265@cindex Close, Diane 1266The @value{DOCUMENT} itself has gone through several previous, 1267preliminary editions. 1268Paul Rubin wrote the very first draft of @cite{The GAWK Manual}; 1269it was around 40 pages in size. 1270Diane Close and Richard Stallman improved it, yielding the 1271version which I started working with in the fall of 1988. 1272It was around 90 pages long and barely described the original, ``old'' 1273version of @command{awk}. After substantial revision, the first version of 1274the @cite{The GAWK Manual} to be released was Edition 0.11 Beta in 1275October of 1989. The manual then underwent more substantial revision 1276for Edition 0.13 of December 1991. 1277David Trueman, Pat Rankin and Michal Jaegermann contributed sections 1278of the manual for Edition 0.13. 1279That edition was published by the 1280FSF as a bound book early in 1992. Since then there were several 1281minor revisions, notably Edition 0.14 of November 1992 that was published 1282by the FSF in January of 1993 and Edition 0.16 of August 1993. 1283 1284Edition 1.0 of @cite{GAWK: The GNU Awk User's Guide} represented a significant re-working 1285of @cite{The GAWK Manual}, with much additional material. 1286The FSF and I agreed that I was now the primary author. 1287@c I also felt that the manual needed a more descriptive title. 1288 1289In January 1996, SSC published Edition 1.0 under the title @cite{Effective AWK Programming}. 1290In February 1997, they published Edition 1.0.3 which had minor changes 1291as a ``second edition.'' 1292In 1999, the FSF published this same version as Edition 2 1293of @cite{GAWK: The GNU Awk User's Guide}. 1294 1295Edition @value{EDITION} maintains the basic structure of Edition 1.0, 1296but with significant additional material, reflecting the host of new features 1297in @command{gawk} @value{PVERSION} @value{VERSION}. 1298Of particular note is 1299@ref{Array Sorting}, 1300@ref{Bitwise Functions}, 1301@ref{Internationalization}, 1302@ref{Advanced Features}, 1303and 1304@ref{Dynamic Extensions}. 1305@end ignore 1306 1307@cindex Close, Diane 1308The @value{DOCUMENT} itself has gone through a number of previous editions. 1309Paul Rubin wrote the very first draft of @cite{The GAWK Manual}; 1310it was around 40 pages in size. 1311Diane Close and Richard Stallman improved it, yielding a 1312version that was 1313around 90 pages long and barely described the original, ``old'' 1314version of @command{awk}. 1315 1316I started working with that version in the fall of 1988. 1317As work on it progressed, 1318the FSF published several preliminary versions (numbered 0.@var{x}). 1319In 1996, Edition 1.0 was released with @command{gawk} 3.0.0. 1320The FSF published the first two editions under 1321the title @cite{The GNU Awk User's Guide}. 1322 1323This edition maintains the basic structure of Edition 1.0, 1324but with significant additional material, reflecting the host of new features 1325in @command{gawk} @value{PVERSION} @value{VERSION}. 1326Of particular note is 1327@ref{Array Sorting}, 1328as well as 1329@ref{Bitwise Functions}, 1330@ref{Internationalization}, 1331and also 1332@ref{Advanced Features}, 1333and 1334@ref{Dynamic Extensions}. 1335 1336@cite{@value{TITLE}} will undoubtedly continue to evolve. 1337An electronic version 1338comes with the @command{gawk} distribution from the FSF. 1339If you find an error in this @value{DOCUMENT}, please report it! 1340@xref{Bugs}, for information on submitting 1341problem reports electronically, or write to me in care of the publisher. 1342 1343@node How To Contribute 1344@unnumberedsec How to Contribute 1345 1346As the maintainer of GNU @command{awk}, 1347I am starting a collection of publicly available @command{awk} 1348programs. 1349For more information, 1350see @uref{ftp://ftp.freefriends.org/arnold/Awkstuff}. 1351If you have written an interesting @command{awk} program, or have written a 1352@command{gawk} extension that you would like to 1353share with the rest of the world, please contact me (@email{arnold@@gnu.org}). 1354Making things available on the Internet helps keep the 1355@command{gawk} distribution down to manageable size. 1356 1357@node Acknowledgments 1358@unnumberedsec Acknowledgments 1359 1360The initial draft of @cite{The GAWK Manual} had the following acknowledgments: 1361 1362@quotation 1363Many people need to be thanked for their assistance in producing this 1364manual. Jay Fenlason contributed many ideas and sample programs. Richard 1365Mlynarik and Robert Chassell gave helpful comments on drafts of this 1366manual. The paper @cite{A Supplemental Document for @command{awk}} by John W.@: 1367Pierce of the Chemistry Department at UC San Diego, pinpointed several 1368issues relevant both to @command{awk} implementation and to this manual, that 1369would otherwise have escaped us. 1370@end quotation 1371 1372@cindex Stallman, Richard 1373I would like to acknowledge Richard M.@: Stallman, for his vision of a 1374better world and for his courage in founding the FSF and starting the 1375GNU Project. 1376 1377The following people (in alphabetical order) 1378provided helpful comments on various 1379versions of this book, up to and including this edition. 1380Rick Adams, 1381Nelson H.F. Beebe, 1382Karl Berry, 1383Dr.@: Michael Brennan, 1384Rich Burridge, 1385Claire Cloutier, 1386Diane Close, 1387Scott Deifik, 1388Christopher (``Topher'') Eliot, 1389Jeffrey Friedl, 1390Dr.@: Darrel Hankerson, 1391Michal Jaegermann, 1392Dr.@: Richard J.@: LeBlanc, 1393Michael Lijewski, 1394Pat Rankin, 1395Miriam Robbins, 1396Mary Sheehan, 1397and 1398Chuck Toporek. 1399 1400@cindex Berry, Karl 1401@cindex Chassell, Robert J.@: 1402@c @cindex Texinfo 1403Robert J.@: Chassell provided much valuable advice on 1404the use of Texinfo. 1405He also deserves special thanks for 1406convincing me @emph{not} to title this @value{DOCUMENT} 1407@cite{How To Gawk Politely}. 1408Karl Berry helped significantly with the @TeX{} part of Texinfo. 1409 1410@cindex Hartholz, Marshall 1411@cindex Hartholz, Elaine 1412@cindex Schreiber, Bert 1413@cindex Schreiber, Rita 1414I would like to thank Marshall and Elaine Hartholz of Seattle and 1415Dr.@: Bert and Rita Schreiber of Detroit for large amounts of quiet vacation 1416time in their homes, which allowed me to make significant progress on 1417this @value{DOCUMENT} and on @command{gawk} itself. 1418 1419@cindex Hughes, Phil 1420Phil Hughes of SSC 1421contributed in a very important way by loaning me his laptop GNU/Linux 1422system, not once, but twice, which allowed me to do a lot of work while 1423away from home. 1424 1425@cindex Trueman, David 1426David Trueman deserves special credit; he has done a yeoman job 1427of evolving @command{gawk} so that it performs well and without bugs. 1428Although he is no longer involved with @command{gawk}, 1429working with him on this project was a significant pleasure. 1430 1431@cindex Drepper, Ulrich 1432@cindex GNITS mailing list 1433@cindex mailing list, GNITS 1434The intrepid members of the GNITS mailing list, and most notably Ulrich 1435Drepper, provided invaluable help and feedback for the design of the 1436internationalization features. 1437 1438@cindex Beebe, Nelson 1439@cindex Brown, Martin 1440@cindex Buening, Andreas 1441@cindex Deifik, Scott 1442@cindex Hankerson, Darrel 1443@cindex Hasegawa, Isamu 1444@cindex Jaegermann, Michal 1445@cindex Kahrs, J@"urgen 1446@cindex Rankin, Pat 1447@cindex Rommel, Kai Uwe 1448@cindex Zaretskii, Eli 1449Nelson Beebe, 1450Martin Brown, 1451Andreas Buening, 1452Scott Deifik, 1453Darrel Hankerson, 1454Isamu Hasegawa, 1455Michal Jaegermann, 1456J@"urgen Kahrs, 1457Pat Rankin, 1458Kai Uwe Rommel, 1459and Eli Zaretskii 1460(in alphabetical order) 1461make up the 1462@command{gawk} ``crack portability team.'' Without their hard work and 1463help, @command{gawk} would not be nearly the fine program it is today. It 1464has been and continues to be a pleasure working with this team of fine 1465people. 1466 1467@cindex Kernighan, Brian 1468David and I would like to thank Brian Kernighan of Bell Laboratories for 1469invaluable assistance during the testing and debugging of @command{gawk}, and for 1470help in clarifying numerous points about the language. We could not have 1471done nearly as good a job on either @command{gawk} or its documentation without 1472his help. 1473 1474Chuck Toporek, Mary Sheehan, and Claire Coutier of O'Reilly & Associates contributed 1475significant editorial help for this @value{DOCUMENT} for the 14763.1 release of @command{gawk}. 1477 1478@cindex Robbins, Miriam 1479@cindex Robbins, Jean 1480@cindex Robbins, Harry 1481@cindex G-d 1482I must thank my wonderful wife, Miriam, for her patience through 1483the many versions of this project, for her proofreading, 1484and for sharing me with the computer. 1485I would like to thank my parents for their love, and for the grace with 1486which they raised and educated me. 1487Finally, I also must acknowledge my gratitude to G-d, for the many opportunities 1488He has sent my way, as well as for the gifts He has given me with which to 1489take advantage of those opportunities. 1490@sp 2 1491@noindent 1492Arnold Robbins @* 1493Nof Ayalon @* 1494ISRAEL @* 1495March, 2001 1496 1497@ignore 1498@c Try this 1499@iftex 1500@page 1501@headings off 1502@majorheading I@ @ @ @ The @command{awk} Language and @command{gawk} 1503Part I describes the @command{awk} language and @command{gawk} program in detail. 1504It starts with the basics, and continues through all of the features of @command{awk} 1505and @command{gawk}. It contains the following chapters: 1506 1507@itemize @bullet 1508@item 1509@ref{Getting Started}. 1510 1511@item 1512@ref{Regexp}. 1513 1514@item 1515@ref{Reading Files}. 1516 1517@item 1518@ref{Printing}. 1519 1520@item 1521@ref{Expressions}. 1522 1523@item 1524@ref{Patterns and Actions}. 1525 1526@item 1527@ref{Arrays}. 1528 1529@item 1530@ref{Functions}. 1531 1532@item 1533@ref{Internationalization}. 1534 1535@item 1536@ref{Advanced Features}. 1537 1538@item 1539@ref{Invoking Gawk}. 1540@end itemize 1541 1542@page 1543@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| 1544@oddheading @| @| @strong{@thischapter}@ @ @ @thispage 1545@end iftex 1546@end ignore 1547 1548@node Getting Started 1549@chapter Getting Started with @command{awk} 1550@c @cindex script, definition of 1551@c @cindex rule, definition of 1552@c @cindex program, definition of 1553@c @cindex basic function of @command{awk} 1554@cindex @command{awk}, function of 1555 1556The basic function of @command{awk} is to search files for lines (or other 1557units of text) that contain certain patterns. When a line matches one 1558of the patterns, @command{awk} performs specified actions on that line. 1559@command{awk} keeps processing input lines in this way until it reaches 1560the end of the input files. 1561 1562@cindex @command{awk}, uses for 1563@c comma here is NOT for secondary 1564@cindex programming languages, data-driven vs. procedural 1565@cindex @command{awk} programs 1566Programs in @command{awk} are different from programs in most other languages, 1567because @command{awk} programs are @dfn{data-driven}; that is, you describe 1568the data you want to work with and then what to do when you find it. 1569Most other languages are @dfn{procedural}; you have to describe, in great 1570detail, every step the program is to take. When working with procedural 1571languages, it is usually much 1572harder to clearly describe the data your program will process. 1573For this reason, @command{awk} programs are often refreshingly easy to 1574read and write. 1575 1576@cindex program, definition of 1577@cindex rule, definition of 1578When you run @command{awk}, you specify an @command{awk} @dfn{program} that 1579tells @command{awk} what to do. The program consists of a series of 1580@dfn{rules}. (It may also contain @dfn{function definitions}, 1581an advanced feature that we will ignore for now. 1582@xref{User-defined}.) Each rule specifies one 1583pattern to search for and one action to perform 1584upon finding the pattern. 1585 1586Syntactically, a rule consists of a pattern followed by an action. The 1587action is enclosed in curly braces to separate it from the pattern. 1588Newlines usually separate rules. Therefore, an @command{awk} 1589program looks like this: 1590 1591@example 1592@var{pattern} @{ @var{action} @} 1593@var{pattern} @{ @var{action} @} 1594@dots{} 1595@end example 1596 1597@menu 1598* Running gawk:: How to run @command{gawk} programs; includes 1599 command-line syntax. 1600* Sample Data Files:: Sample data files for use in the @command{awk} 1601 programs illustrated in this @value{DOCUMENT}. 1602* Very Simple:: A very simple example. 1603* Two Rules:: A less simple one-line example using two 1604 rules. 1605* More Complex:: A more complex example. 1606* Statements/Lines:: Subdividing or combining statements into 1607 lines. 1608* Other Features:: Other Features of @command{awk}. 1609* When:: When to use @command{gawk} and when to use 1610 other things. 1611@end menu 1612 1613@node Running gawk 1614@section How to Run @command{awk} Programs 1615 1616@cindex @command{awk} programs, running 1617There are several ways to run an @command{awk} program. If the program is 1618short, it is easiest to include it in the command that runs @command{awk}, 1619like this: 1620 1621@example 1622awk '@var{program}' @var{input-file1} @var{input-file2} @dots{} 1623@end example 1624 1625@cindex command line, formats 1626When the program is long, it is usually more convenient to put it in a file 1627and run it with a command like this: 1628 1629@example 1630awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{} 1631@end example 1632 1633This @value{SECTION} discusses both mechanisms, along with several 1634variations of each. 1635 1636@menu 1637* One-shot:: Running a short throwaway @command{awk} 1638 program. 1639* Read Terminal:: Using no input files (input from terminal 1640 instead). 1641* Long:: Putting permanent @command{awk} programs in 1642 files. 1643* Executable Scripts:: Making self-contained @command{awk} programs. 1644* Comments:: Adding documentation to @command{gawk} 1645 programs. 1646* Quoting:: More discussion of shell quoting issues. 1647@end menu 1648 1649@node One-shot 1650@subsection One-Shot Throwaway @command{awk} Programs 1651 1652Once you are familiar with @command{awk}, you will often type in simple 1653programs the moment you want to use them. Then you can write the 1654program as the first argument of the @command{awk} command, like this: 1655 1656@example 1657awk '@var{program}' @var{input-file1} @var{input-file2} @dots{} 1658@end example 1659 1660@noindent 1661where @var{program} consists of a series of @var{patterns} and 1662@var{actions}, as described earlier. 1663 1664@cindex single quote (@code{'}) 1665@cindex @code{'} (single quote) 1666This command format instructs the @dfn{shell}, or command interpreter, 1667to start @command{awk} and use the @var{program} to process records in the 1668input file(s). There are single quotes around @var{program} so 1669the shell won't interpret any @command{awk} characters as special shell 1670characters. The quotes also cause the shell to treat all of @var{program} as 1671a single argument for @command{awk}, and allow @var{program} to be more 1672than one line long. 1673 1674@cindex shells, scripts 1675@cindex @command{awk} programs, running, from shell scripts 1676This format is also useful for running short or medium-sized @command{awk} 1677programs from shell scripts, because it avoids the need for a separate 1678file for the @command{awk} program. A self-contained shell script is more 1679reliable because there are no other files to misplace. 1680 1681@ref{Very Simple}, 1682@ifnotinfo 1683later in this @value{CHAPTER}, 1684@end ifnotinfo 1685presents several short, 1686self-contained programs. 1687 1688@c Removed for gawk 3.1, doesn't really add anything here. 1689@ignore 1690As an interesting side point, the command 1691 1692@example 1693awk '/foo/' @var{files} @dots{} 1694@end example 1695 1696@noindent 1697is essentially the same as 1698 1699@cindex @command{egrep} utility 1700@example 1701egrep foo @var{files} @dots{} 1702@end example 1703@end ignore 1704 1705@node Read Terminal 1706@subsection Running @command{awk} Without Input Files 1707 1708@cindex standard input 1709@cindex input, standard 1710@cindex input files, running @command{awk} without 1711You can also run @command{awk} without any input files. If you type the 1712following command line: 1713 1714@example 1715awk '@var{program}' 1716@end example 1717 1718@noindent 1719@command{awk} applies the @var{program} to the @dfn{standard input}, 1720which usually means whatever you type on the terminal. This continues 1721until you indicate end-of-file by typing @kbd{@value{CTL}-d}. 1722(On other operating systems, the end-of-file character may be different. 1723For example, on OS/2 and MS-DOS, it is @kbd{@value{CTL}-z}.) 1724 1725@cindex files, input, See input files 1726@cindex input files, running @command{awk} without 1727@cindex @command{awk} programs, running, without input files 1728As an example, the following program prints a friendly piece of advice 1729(from Douglas Adams's @cite{The Hitchhiker's Guide to the Galaxy}), 1730to keep you from worrying about the complexities of computer programming 1731(@code{BEGIN} is a feature we haven't discussed yet): 1732 1733@example 1734$ awk "BEGIN @{ print \"Don't Panic!\" @}" 1735@print{} Don't Panic! 1736@end example 1737 1738@cindex quoting 1739@cindex double quote (@code{"}) 1740@cindex @code{"} (double quote) 1741@cindex @code{\} (backslash) 1742@cindex backslash (@code{\}) 1743This program does not read any input. The @samp{\} before each of the 1744inner double quotes is necessary because of the shell's quoting 1745rules---in particular because it mixes both single quotes and 1746double quotes.@footnote{Although we generally recommend the use of single 1747quotes around the program text, double quotes are needed here in order to 1748put the single quote into the message.} 1749 1750This next simple @command{awk} program 1751emulates the @command{cat} utility; it copies whatever you type on the 1752keyboard to its standard output (why this works is explained shortly). 1753 1754@example 1755$ awk '@{ print @}' 1756Now is the time for all good men 1757@print{} Now is the time for all good men 1758to come to the aid of their country. 1759@print{} to come to the aid of their country. 1760Four score and seven years ago, ... 1761@print{} Four score and seven years ago, ... 1762What, me worry? 1763@print{} What, me worry? 1764@kbd{@value{CTL}-d} 1765@end example 1766 1767@node Long 1768@subsection Running Long Programs 1769 1770@cindex @command{awk} programs, running 1771@cindex @command{awk} programs, lengthy 1772@cindex files, @command{awk} programs in 1773Sometimes your @command{awk} programs can be very long. In this case, it is 1774more convenient to put the program into a separate file. In order to tell 1775@command{awk} to use that file for its program, you type: 1776 1777@example 1778awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{} 1779@end example 1780 1781@cindex @code{-f} option 1782@cindex command line, options 1783@cindex options, command-line 1784The @option{-f} instructs the @command{awk} utility to get the @command{awk} program 1785from the file @var{source-file}. Any @value{FN} can be used for 1786@var{source-file}. For example, you could put the program: 1787 1788@example 1789BEGIN @{ print "Don't Panic!" @} 1790@end example 1791 1792@noindent 1793into the file @file{advice}. Then this command: 1794 1795@example 1796awk -f advice 1797@end example 1798 1799@noindent 1800does the same thing as this one: 1801 1802@example 1803awk "BEGIN @{ print \"Don't Panic!\" @}" 1804@end example 1805 1806@cindex quoting 1807@noindent 1808This was explained earlier 1809(@pxref{Read Terminal}). 1810Note that you don't usually need single quotes around the @value{FN} that you 1811specify with @option{-f}, because most @value{FN}s don't contain any of the shell's 1812special characters. Notice that in @file{advice}, the @command{awk} 1813program did not have single quotes around it. The quotes are only needed 1814for programs that are provided on the @command{awk} command line. 1815 1816@c STARTOFRANGE sq1x 1817@cindex single quote (@code{'}) 1818@c STARTOFRANGE qs2x 1819@cindex @code{'} (single quote) 1820If you want to identify your @command{awk} program files clearly as such, 1821you can add the extension @file{.awk} to the @value{FN}. This doesn't 1822affect the execution of the @command{awk} program but it does make 1823``housekeeping'' easier. 1824 1825@node Executable Scripts 1826@subsection Executable @command{awk} Programs 1827@cindex @command{awk} programs 1828@cindex @code{#} (number sign), @code{#!} (executable scripts) 1829@cindex number sign (@code{#}), @code{#!} (executable scripts) 1830@cindex Unix, @command{awk} scripts and 1831@cindex @code{#} (number sign), @code{#!} (executable scripts), portability issues with 1832@cindex number sign (@code{#}), @code{#!} (executable scripts), portability issues with 1833 1834Once you have learned @command{awk}, you may want to write self-contained 1835@command{awk} scripts, using the @samp{#!} script mechanism. You can do 1836this on many Unix systems@footnote{The @samp{#!} mechanism works on 1837Linux systems, 1838systems derived from the 4.4-Lite Berkeley Software Distribution, 1839and most commercial Unix systems.} as well as on the GNU system. 1840For example, you could update the file @file{advice} to look like this: 1841 1842@example 1843#! /bin/awk -f 1844 1845BEGIN @{ print "Don't Panic!" @} 1846@end example 1847 1848@noindent 1849After making this file executable (with the @command{chmod} utility), 1850simply type @samp{advice} 1851at the shell and the system arranges to run @command{awk}@footnote{The 1852line beginning with @samp{#!} lists the full @value{FN} of an interpreter 1853to run and an optional initial command-line argument to pass to that 1854interpreter. The operating system then runs the interpreter with the given 1855argument and the full argument list of the executed program. The first argument 1856in the list is the full @value{FN} of the @command{awk} program. The rest of the 1857argument list contains either options to @command{awk}, or @value{DF}s, 1858or both.} as if you had 1859typed @samp{awk -f advice}: 1860 1861@example 1862$ chmod +x advice 1863$ advice 1864@print{} Don't Panic! 1865@end example 1866 1867@noindent 1868(We assume you have the current directory in your shell's search 1869path variable (typically @code{$PATH}). If not, you may need 1870to type @samp{./advice} at the shell.) 1871 1872Self-contained @command{awk} scripts are useful when you want to write a 1873program that users can invoke without their having to know that the program is 1874written in @command{awk}. 1875 1876@c fakenode --- for prepinfo 1877@subheading Advanced Notes: Portability Issues with @samp{#!} 1878@cindex portability, @code{#!} (executable scripts) 1879 1880Some systems limit the length of the interpreter name to 32 characters. 1881Often, this can be dealt with by using a symbolic link. 1882 1883You should not put more than one argument on the @samp{#!} 1884line after the path to @command{awk}. It does not work. The operating system 1885treats the rest of the line as a single argument and passes it to @command{awk}. 1886Doing this leads to confusing behavior---most likely a usage diagnostic 1887of some sort from @command{awk}. 1888 1889@cindex @code{ARGC}/@code{ARGV} variables, portability and 1890@cindex portability, @code{ARGV} variable 1891Finally, 1892the value of @code{ARGV[0]} 1893(@pxref{Built-in Variables}) 1894varies depending upon your operating system. 1895Some systems put @samp{awk} there, some put the full pathname 1896of @command{awk} (such as @file{/bin/awk}), and some put the name 1897of your script (@samp{advice}). Don't rely on the value of @code{ARGV[0]} 1898to provide your script name. 1899 1900@node Comments 1901@subsection Comments in @command{awk} Programs 1902@cindex @code{#} (number sign), commenting 1903@cindex number sign (@code{#}), commenting 1904@cindex commenting 1905@cindex @command{awk} programs, documenting 1906 1907A @dfn{comment} is some text that is included in a program for the sake 1908of human readers; it is not really an executable part of the program. Comments 1909can explain what the program does and how it works. Nearly all 1910programming languages have provisions for comments, as programs are 1911typically hard to understand without them. 1912 1913In the @command{awk} language, a comment starts with the sharp sign 1914character (@samp{#}) and continues to the end of the line. 1915The @samp{#} does not have to be the first character on the line. The 1916@command{awk} language ignores the rest of a line following a sharp sign. 1917For example, we could have put the following into @file{advice}: 1918 1919@example 1920# This program prints a nice friendly message. It helps 1921# keep novice users from being afraid of the computer. 1922BEGIN @{ print "Don't Panic!" @} 1923@end example 1924 1925You can put comment lines into keyboard-composed throwaway @command{awk} 1926programs, but this usually isn't very useful; the purpose of a 1927comment is to help you or another person understand the program 1928when reading it at a later time. 1929 1930@cindex quoting 1931@cindex single quote (@code{'}), vs. apostrophe 1932@cindex @code{'} (single quote), vs. apostrophe 1933@strong{Caution:} As mentioned in 1934@ref{One-shot}, 1935you can enclose small to medium programs in single quotes, in order to keep 1936your shell scripts self-contained. When doing so, @emph{don't} put 1937an apostrophe (i.e., a single quote) into a comment (or anywhere else 1938in your program). The shell interprets the quote as the closing 1939quote for the entire program. As a result, usually the shell 1940prints a message about mismatched quotes, and if @command{awk} actually 1941runs, it will probably print strange messages about syntax errors. 1942For example, look at the following: 1943 1944@example 1945$ awk '@{ print "hello" @} # let's be cute' 1946> 1947@end example 1948 1949The shell sees that the first two quotes match, and that 1950a new quoted object begins at the end of the command line. 1951It therefore prompts with the secondary prompt, waiting for more input. 1952With Unix @command{awk}, closing the quoted string produces this result: 1953 1954@example 1955$ awk '@{ print "hello" @} # let's be cute' 1956> ' 1957@error{} awk: can't open file be 1958@error{} source line number 1 1959@end example 1960 1961@cindex @code{\} (backslash) 1962@cindex backslash (@code{\}) 1963Putting a backslash before the single quote in @samp{let's} wouldn't help, 1964since backslashes are not special inside single quotes. 1965The next @value{SUBSECTION} describes the shell's quoting rules. 1966 1967@node Quoting 1968@subsection Shell-Quoting Issues 1969@cindex quoting, rules for 1970 1971For short to medium length @command{awk} programs, it is most convenient 1972to enter the program on the @command{awk} command line. 1973This is best done by enclosing the entire program in single quotes. 1974This is true whether you are entering the program interactively at 1975the shell prompt, or writing it as part of a larger shell script: 1976 1977@example 1978awk '@var{program text}' @var{input-file1} @var{input-file2} @dots{} 1979@end example 1980 1981@cindex shells, quoting, rules for 1982@cindex Bourne shell, quoting rules for 1983Once you are working with the shell, it is helpful to have a basic 1984knowledge of shell quoting rules. The following rules apply only to 1985POSIX-compliant, Bourne-style shells (such as @command{bash}, the GNU Bourne-Again 1986Shell). If you use @command{csh}, you're on your own. 1987 1988@itemize @bullet 1989@item 1990Quoted items can be concatenated with nonquoted items as well as with other 1991quoted items. The shell turns everything into one argument for 1992the command. 1993 1994@item 1995Preceding any single character with a backslash (@samp{\}) quotes 1996that character. The shell removes the backslash and passes the quoted 1997character on to the command. 1998 1999@item 2000@cindex @code{\} (backslash) 2001@cindex backslash (@code{\}) 2002@cindex single quote (@code{'}) 2003@cindex @code{'} (single quote) 2004Single quotes protect everything between the opening and closing quotes. 2005The shell does no interpretation of the quoted text, passing it on verbatim 2006to the command. 2007It is @emph{impossible} to embed a single quote inside single-quoted text. 2008Refer back to 2009@ref{Comments}, 2010for an example of what happens if you try. 2011 2012@item 2013@cindex double quote (@code{"}) 2014@cindex @code{"} (double quote) 2015Double quotes protect most things between the opening and closing quotes. 2016The shell does at least variable and command substitution on the quoted text. 2017Different shells may do additional kinds of processing on double-quoted text. 2018 2019Since certain characters within double-quoted text are processed by the shell, 2020they must be @dfn{escaped} within the text. Of note are the characters 2021@samp{$}, @samp{`}, @samp{\}, and @samp{"}, all of which must be preceded by 2022a backslash within double-quoted text if they are to be passed on literally 2023to the program. (The leading backslash is stripped first.) 2024Thus, the example seen 2025@ifnotinfo 2026previously 2027@end ifnotinfo 2028in @ref{Read Terminal}, 2029is applicable: 2030 2031@example 2032$ awk "BEGIN @{ print \"Don't Panic!\" @}" 2033@print{} Don't Panic! 2034@end example 2035 2036@cindex single quote (@code{'}), with double quotes 2037@cindex @code{'} (single quote), with double quotes 2038Note that the single quote is not special within double quotes. 2039 2040@item 2041Null strings are removed when they occur as part of a non-null 2042command-line argument, while explicit non-null objects are kept. 2043For example, to specify that the field separator @code{FS} should 2044be set to the null string, use: 2045 2046@example 2047awk -F "" '@var{program}' @var{files} # correct 2048@end example 2049 2050@noindent 2051@cindex null strings, quoting and 2052Don't use this: 2053 2054@example 2055awk -F"" '@var{program}' @var{files} # wrong! 2056@end example 2057 2058@noindent 2059In the second case, @command{awk} will attempt to use the text of the program 2060as the value of @code{FS}, and the first @value{FN} as the text of the program! 2061This results in syntax errors at best, and confusing behavior at worst. 2062@end itemize 2063 2064@cindex quoting, tricks for 2065Mixing single and double quotes is difficult. You have to resort 2066to shell quoting tricks, like this: 2067 2068@example 2069$ awk 'BEGIN @{ print "Here is a single quote <'"'"'>" @}' 2070@print{} Here is a single quote <'> 2071@end example 2072 2073@noindent 2074This program consists of three concatenated quoted strings. The first and the 2075third are single-quoted, the second is double-quoted. 2076 2077This can be ``simplified'' to: 2078 2079@example 2080$ awk 'BEGIN @{ print "Here is a single quote <'\''>" @}' 2081@print{} Here is a single quote <'> 2082@end example 2083 2084@noindent 2085Judge for yourself which of these two is the more readable. 2086 2087Another option is to use double quotes, escaping the embedded, @command{awk}-level 2088double quotes: 2089 2090@example 2091$ awk "BEGIN @{ print \"Here is a single quote <'>\" @}" 2092@print{} Here is a single quote <'> 2093@end example 2094 2095@noindent 2096@c ENDOFRANGE sq1x 2097@c ENDOFRANGE qs2x 2098This option is also painful, because double quotes, backslashes, and dollar signs 2099are very common in @command{awk} programs. 2100 2101If you really need both single and double quotes in your @command{awk} 2102program, it is probably best to move it into a separate file, where 2103the shell won't be part of the picture, and you can say what you mean. 2104 2105@node Sample Data Files 2106@section @value{DDF}s for the Examples 2107@c For gawk >= 3.2, update these data files. No-one has such slow modems! 2108 2109@cindex input files, examples 2110@cindex @code{BBS-list} file 2111Many of the examples in this @value{DOCUMENT} take their input from two sample 2112@value{DF}s. The first, @file{BBS-list}, represents a list of 2113computer bulletin board systems together with information about those systems. 2114The second @value{DF}, called @file{inventory-shipped}, contains 2115information about monthly shipments. In both files, 2116each line is considered to be one @dfn{record}. 2117 2118In the @value{DF} @file{BBS-list}, each record contains the name of a computer 2119bulletin board, its phone number, the board's baud rate(s), and a code for 2120the number of hours it is operational. An @samp{A} in the last column 2121means the board operates 24 hours a day. A @samp{B} in the last 2122column means the board only operates on evening and weekend hours. 2123A @samp{C} means the board operates only on weekends: 2124 2125@c 2e: Update the baud rates to reflect today's faster modems 2126@example 2127@c system if test ! -d eg ; then mkdir eg ; fi 2128@c system if test ! -d eg/lib ; then mkdir eg/lib ; fi 2129@c system if test ! -d eg/data ; then mkdir eg/data ; fi 2130@c system if test ! -d eg/prog ; then mkdir eg/prog ; fi 2131@c system if test ! -d eg/misc ; then mkdir eg/misc ; fi 2132@c file eg/data/BBS-list 2133aardvark 555-5553 1200/300 B 2134alpo-net 555-3412 2400/1200/300 A 2135barfly 555-7685 1200/300 A 2136bites 555-1675 2400/1200/300 A 2137camelot 555-0542 300 C 2138core 555-2912 1200/300 C 2139fooey 555-1234 2400/1200/300 B 2140foot 555-6699 1200/300 B 2141macfoo 555-6480 1200/300 A 2142sdace 555-3430 2400/1200/300 A 2143sabafoo 555-2127 1200/300 C 2144@c endfile 2145@end example 2146 2147@cindex @code{inventory-shipped} file 2148The @value{DF} @file{inventory-shipped} represents 2149information about shipments during the year. 2150Each record contains the month, the number 2151of green crates shipped, the number of red boxes shipped, the number of 2152orange bags shipped, and the number of blue packages shipped, 2153respectively. There are 16 entries, covering the 12 months of last year 2154and the first four months of the current year. 2155 2156@example 2157@c file eg/data/inventory-shipped 2158Jan 13 25 15 115 2159Feb 15 32 24 226 2160Mar 15 24 34 228 2161Apr 31 52 63 420 2162May 16 34 29 208 2163Jun 31 42 75 492 2164Jul 24 34 67 436 2165Aug 15 34 47 316 2166Sep 13 55 37 277 2167Oct 29 54 68 525 2168Nov 20 87 82 577 2169Dec 17 35 61 401 2170 2171Jan 21 36 64 620 2172Feb 26 58 80 652 2173Mar 24 75 70 495 2174Apr 21 70 74 514 2175@c endfile 2176@end example 2177 2178@ifinfo 2179If you are reading this in GNU Emacs using Info, you can copy the regions 2180of text showing these sample files into your own test files. This way you 2181can try out the examples shown in the remainder of this document. You do 2182this by using the command @kbd{M-x write-region} to copy text from the Info 2183file into a file for use with @command{awk} 2184(@xref{Misc File Ops, , Miscellaneous File Operations, emacs, GNU Emacs Manual}, 2185for more information). Using this information, create your own 2186@file{BBS-list} and @file{inventory-shipped} files and practice what you 2187learn in this @value{DOCUMENT}. 2188 2189@cindex Texinfo 2190If you are using the stand-alone version of Info, 2191see @ref{Extract Program}, 2192for an @command{awk} program that extracts these @value{DF}s from 2193@file{gawk.texi}, the Texinfo source file for this Info file. 2194@end ifinfo 2195 2196@node Very Simple 2197@section Some Simple Examples 2198 2199The following command runs a simple @command{awk} program that searches the 2200input file @file{BBS-list} for the character string @samp{foo} (a 2201grouping of characters is usually called a @dfn{string}; 2202the term @dfn{string} is based on similar usage in English, such 2203as ``a string of pearls,'' or ``a string of cars in a train''): 2204 2205@example 2206awk '/foo/ @{ print $0 @}' BBS-list 2207@end example 2208 2209@noindent 2210When lines containing @samp{foo} are found, they are printed because 2211@w{@samp{print $0}} means print the current line. (Just @samp{print} by 2212itself means the same thing, so we could have written that 2213instead.) 2214 2215You will notice that slashes (@samp{/}) surround the string @samp{foo} 2216in the @command{awk} program. The slashes indicate that @samp{foo} 2217is the pattern to search for. This type of pattern is called a 2218@dfn{regular expression}, which is covered in more detail later 2219(@pxref{Regexp}). 2220The pattern is allowed to match parts of words. 2221There are 2222single quotes around the @command{awk} program so that the shell won't 2223interpret any of it as special shell characters. 2224 2225Here is what this program prints: 2226 2227@example 2228$ awk '/foo/ @{ print $0 @}' BBS-list 2229@print{} fooey 555-1234 2400/1200/300 B 2230@print{} foot 555-6699 1200/300 B 2231@print{} macfoo 555-6480 1200/300 A 2232@print{} sabafoo 555-2127 1200/300 C 2233@end example 2234 2235@cindex actions, default 2236@cindex patterns, default 2237In an @command{awk} rule, either the pattern or the action can be omitted, 2238but not both. If the pattern is omitted, then the action is performed 2239for @emph{every} input line. If the action is omitted, the default 2240action is to print all lines that match the pattern. 2241 2242@cindex actions, empty 2243Thus, we could leave out the action (the @code{print} statement and the curly 2244braces) in the previous example and the result would be the same: all 2245lines matching the pattern @samp{foo} are printed. By comparison, 2246omitting the @code{print} statement but retaining the curly braces makes an 2247empty action that does nothing (i.e., no lines are printed). 2248 2249@cindex @command{awk} programs, one-line examples 2250Many practical @command{awk} programs are just a line or two. Following is a 2251collection of useful, short programs to get you started. Some of these 2252programs contain constructs that haven't been covered yet. (The description 2253of the program will give you a good idea of what is going on, but please 2254read the rest of the @value{DOCUMENT} to become an @command{awk} expert!) 2255Most of the examples use a @value{DF} named @file{data}. This is just a 2256placeholder; if you use these programs yourself, substitute 2257your own @value{FN}s for @file{data}. 2258For future reference, note that there is often more than 2259one way to do things in @command{awk}. At some point, you may want 2260to look back at these examples and see if 2261you can come up with different ways to do the same things shown here: 2262 2263@itemize @bullet 2264@item 2265Print the length of the longest input line: 2266 2267@example 2268awk '@{ if (length($0) > max) max = length($0) @} 2269 END @{ print max @}' data 2270@end example 2271 2272@item 2273Print every line that is longer than 80 characters: 2274 2275@example 2276awk 'length($0) > 80' data 2277@end example 2278 2279The sole rule has a relational expression as its pattern and it has no 2280action---so the default action, printing the record, is used. 2281 2282@cindex @command{expand} utility 2283@item 2284Print the length of the longest line in @file{data}: 2285 2286@example 2287expand data | awk '@{ if (x < length()) x = length() @} 2288 END @{ print "maximum line length is " x @}' 2289@end example 2290 2291The input is processed by the @command{expand} utility to change tabs 2292into spaces, so the widths compared are actually the right-margin columns. 2293 2294@item 2295Print every line that has at least one field: 2296 2297@example 2298awk 'NF > 0' data 2299@end example 2300 2301This is an easy way to delete blank lines from a file (or rather, to 2302create a new file similar to the old file but from which the blank lines 2303have been removed). 2304 2305@item 2306Print seven random numbers from 0 to 100, inclusive: 2307 2308@example 2309awk 'BEGIN @{ for (i = 1; i <= 7; i++) 2310 print int(101 * rand()) @}' 2311@end example 2312 2313@item 2314Print the total number of bytes used by @var{files}: 2315 2316@example 2317ls -l @var{files} | awk '@{ x += $5 @} 2318 END @{ print "total bytes: " x @}' 2319@end example 2320 2321@item 2322Print the total number of kilobytes used by @var{files}: 2323 2324@c Don't use \ continuation, not discussed yet 2325@example 2326ls -l @var{files} | awk '@{ x += $5 @} 2327 END @{ print "total K-bytes: " (x + 1023)/1024 @}' 2328@end example 2329 2330@item 2331Print a sorted list of the login names of all users: 2332 2333@example 2334awk -F: '@{ print $1 @}' /etc/passwd | sort 2335@end example 2336 2337@item 2338Count the lines in a file: 2339 2340@example 2341awk 'END @{ print NR @}' data 2342@end example 2343 2344@item 2345Print the even-numbered lines in the @value{DF}: 2346 2347@example 2348awk 'NR % 2 == 0' data 2349@end example 2350 2351If you use the expression @samp{NR % 2 == 1} instead, 2352the program would print the odd-numbered lines. 2353@end itemize 2354 2355@node Two Rules 2356@section An Example with Two Rules 2357@cindex @command{awk} programs 2358 2359The @command{awk} utility reads the input files one line at a 2360time. For each line, @command{awk} tries the patterns of each of the rules. 2361If several patterns match, then several actions are run in the order in 2362which they appear in the @command{awk} program. If no patterns match, then 2363no actions are run. 2364 2365After processing all the rules that match the line (and perhaps there are none), 2366@command{awk} reads the next line. (However, 2367@pxref{Next Statement}, 2368and also @pxref{Nextfile Statement}). 2369This continues until the program reaches the end of the file. 2370For example, the following @command{awk} program contains two rules: 2371 2372@example 2373/12/ @{ print $0 @} 2374/21/ @{ print $0 @} 2375@end example 2376 2377@noindent 2378The first rule has the string @samp{12} as the 2379pattern and @samp{print $0} as the action. The second rule has the 2380string @samp{21} as the pattern and also has @samp{print $0} as the 2381action. Each rule's action is enclosed in its own pair of braces. 2382 2383This program prints every line that contains the string 2384@samp{12} @emph{or} the string @samp{21}. If a line contains both 2385strings, it is printed twice, once by each rule. 2386 2387This is what happens if we run this program on our two sample @value{DF}s, 2388@file{BBS-list} and @file{inventory-shipped}: 2389 2390@example 2391$ awk '/12/ @{ print $0 @} 2392> /21/ @{ print $0 @}' BBS-list inventory-shipped 2393@print{} aardvark 555-5553 1200/300 B 2394@print{} alpo-net 555-3412 2400/1200/300 A 2395@print{} barfly 555-7685 1200/300 A 2396@print{} bites 555-1675 2400/1200/300 A 2397@print{} core 555-2912 1200/300 C 2398@print{} fooey 555-1234 2400/1200/300 B 2399@print{} foot 555-6699 1200/300 B 2400@print{} macfoo 555-6480 1200/300 A 2401@print{} sdace 555-3430 2400/1200/300 A 2402@print{} sabafoo 555-2127 1200/300 C 2403@print{} sabafoo 555-2127 1200/300 C 2404@print{} Jan 21 36 64 620 2405@print{} Apr 21 70 74 514 2406@end example 2407 2408@noindent 2409Note how the line beginning with @samp{sabafoo} 2410in @file{BBS-list} was printed twice, once for each rule. 2411 2412@node More Complex 2413@section A More Complex Example 2414 2415Now that we've mastered some simple tasks, let's look at 2416what typical @command{awk} 2417programs do. This example shows how @command{awk} can be used to 2418summarize, select, and rearrange the output of another utility. It uses 2419features that haven't been covered yet, so don't worry if you don't 2420understand all the details: 2421 2422@example 2423ls -l | awk '$6 == "Nov" @{ sum += $5 @} 2424 END @{ print sum @}' 2425@end example 2426 2427@cindex @command{csh} utility, backslash continuation and 2428@cindex @command{ls} utility 2429@cindex backslash (@code{\}), continuing lines and, in @command{csh} 2430@cindex @code{\} (backslash), continuing lines and, in @command{csh} 2431This command prints the total number of bytes in all the files in the 2432current directory that were last modified in November (of any year). 2433@footnote{In the C shell (@command{csh}), you need to type 2434a semicolon and then a backslash at the end of the first line; see 2435@ref{Statements/Lines}, for an 2436explanation. In a POSIX-compliant shell, such as the Bourne 2437shell or @command{bash}, you can type the example as shown. If the command 2438@samp{echo $path} produces an empty output line, you are most likely 2439using a POSIX-compliant shell. Otherwise, you are probably using the 2440C shell or a shell derived from it.} 2441The @w{@samp{ls -l}} part of this example is a system command that gives 2442you a listing of the files in a directory, including each file's size and the date 2443the file was last modified. Its output looks like this: 2444 2445@example 2446-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile 2447-rw-r--r-- 1 arnold user 10809 Nov 7 13:03 awk.h 2448-rw-r--r-- 1 arnold user 983 Apr 13 12:14 awk.tab.h 2449-rw-r--r-- 1 arnold user 31869 Jun 15 12:20 awk.y 2450-rw-r--r-- 1 arnold user 22414 Nov 7 13:03 awk1.c 2451-rw-r--r-- 1 arnold user 37455 Nov 7 13:03 awk2.c 2452-rw-r--r-- 1 arnold user 27511 Dec 9 13:07 awk3.c 2453-rw-r--r-- 1 arnold user 7989 Nov 7 13:03 awk4.c 2454@end example 2455 2456@noindent 2457@cindex line continuations, with C shell 2458The first field contains read-write permissions, the second field contains 2459the number of links to the file, and the third field identifies the owner of 2460the file. The fourth field identifies the group of the file. 2461The fifth field contains the size of the file in bytes. The 2462sixth, seventh, and eighth fields contain the month, day, and time, 2463respectively, that the file was last modified. Finally, the ninth field 2464contains the name of the file.@footnote{On some 2465very old systems, you may need to use @samp{ls -lg} to get this output.} 2466 2467@c @cindex automatic initialization 2468@cindex initialization, automatic 2469The @samp{$6 == "Nov"} in our @command{awk} program is an expression that 2470tests whether the sixth field of the output from @w{@samp{ls -l}} 2471matches the string @samp{Nov}. Each time a line has the string 2472@samp{Nov} for its sixth field, the action @samp{sum += $5} is 2473performed. This adds the fifth field (the file's size) to the variable 2474@code{sum}. As a result, when @command{awk} has finished reading all the 2475input lines, @code{sum} is the total of the sizes of the files whose 2476lines matched the pattern. (This works because @command{awk} variables 2477are automatically initialized to zero.) 2478 2479After the last line of output from @command{ls} has been processed, the 2480@code{END} rule executes and prints the value of @code{sum}. 2481In this example, the value of @code{sum} is 80600. 2482 2483These more advanced @command{awk} techniques are covered in later sections 2484(@pxref{Action Overview}). Before you can move on to more 2485advanced @command{awk} programming, you have to know how @command{awk} interprets 2486your input and displays your output. By manipulating fields and using 2487@code{print} statements, you can produce some very useful and 2488impressive-looking reports. 2489 2490@node Statements/Lines 2491@section @command{awk} Statements Versus Lines 2492@cindex line breaks 2493@cindex newlines 2494 2495Most often, each line in an @command{awk} program is a separate statement or 2496separate rule, like this: 2497 2498@example 2499awk '/12/ @{ print $0 @} 2500 /21/ @{ print $0 @}' BBS-list inventory-shipped 2501@end example 2502 2503@cindex @command{gawk}, newlines in 2504However, @command{gawk} ignores newlines after any of the following 2505symbols and keywords: 2506 2507@example 2508, @{ ? : || && do else 2509@end example 2510 2511@noindent 2512A newline at any other point is considered the end of the 2513statement.@footnote{The @samp{?} and @samp{:} referred to here is the 2514three-operand conditional expression described in 2515@ref{Conditional Exp}. 2516Splitting lines after @samp{?} and @samp{:} is a minor @command{gawk} 2517extension; if @option{--posix} is specified 2518(@pxref{Options}), then this extension is disabled.} 2519 2520@cindex @code{\} (backslash), continuing lines and 2521@cindex backslash (@code{\}), continuing lines and 2522If you would like to split a single statement into two lines at a point 2523where a newline would terminate it, you can @dfn{continue} it by ending the 2524first line with a backslash character (@samp{\}). The backslash must be 2525the final character on the line in order to be recognized as a continuation 2526character. A backslash is allowed anywhere in the statement, even 2527in the middle of a string or regular expression. For example: 2528 2529@example 2530awk '/This regular expression is too long, so continue it\ 2531 on the next line/ @{ print $1 @}' 2532@end example 2533 2534@noindent 2535@cindex portability, backslash continuation and 2536We have generally not used backslash continuation in the sample programs 2537in this @value{DOCUMENT}. In @command{gawk}, there is no limit on the 2538length of a line, so backslash continuation is never strictly necessary; 2539it just makes programs more readable. For this same reason, as well as 2540for clarity, we have kept most statements short in the sample programs 2541presented throughout the @value{DOCUMENT}. Backslash continuation is 2542most useful when your @command{awk} program is in a separate source file 2543instead of entered from the command line. You should also note that 2544many @command{awk} implementations are more particular about where you 2545may use backslash continuation. For example, they may not allow you to 2546split a string constant using backslash continuation. Thus, for maximum 2547portability of your @command{awk} programs, it is best not to split your 2548lines in the middle of a regular expression or a string. 2549@c 10/2000: gawk, mawk, and current bell labs awk allow it, 2550@c solaris 2.7 nawk does not. Solaris /usr/xpg4/bin/awk does though! sigh. 2551 2552@cindex @command{csh} utility 2553@cindex backslash (@code{\}), continuing lines and, in @command{csh} 2554@cindex @code{\} (backslash), continuing lines and, in @command{csh} 2555@strong{Caution:} @emph{Backslash continuation does not work as described 2556with the C shell.} It works for @command{awk} programs in files and 2557for one-shot programs, @emph{provided} you are using a POSIX-compliant 2558shell, such as the Unix Bourne shell or @command{bash}. But the C shell behaves 2559differently! There, you must use two backslashes in a row, followed by 2560a newline. Note also that when using the C shell, @emph{every} newline 2561in your awk program must be escaped with a backslash. To illustrate: 2562 2563@example 2564% awk 'BEGIN @{ \ 2565? print \\ 2566? "hello, world" \ 2567? @}' 2568@print{} hello, world 2569@end example 2570 2571@noindent 2572Here, the @samp{%} and @samp{?} are the C shell's primary and secondary 2573prompts, analogous to the standard shell's @samp{$} and @samp{>}. 2574 2575Compare the previous example to how it is done with a POSIX-compliant shell: 2576 2577@example 2578$ awk 'BEGIN @{ 2579> print \ 2580> "hello, world" 2581> @}' 2582@print{} hello, world 2583@end example 2584 2585@command{awk} is a line-oriented language. Each rule's action has to 2586begin on the same line as the pattern. To have the pattern and action 2587on separate lines, you @emph{must} use backslash continuation; there 2588is no other option. 2589 2590@cindex backslash (@code{\}), continuing lines and, comments and 2591@cindex @code{\} (backslash), continuing lines and, comments and 2592@cindex commenting, backslash continuation and 2593Another thing to keep in mind is that backslash continuation and 2594comments do not mix. As soon as @command{awk} sees the @samp{#} that 2595starts a comment, it ignores @emph{everything} on the rest of the 2596line. For example: 2597 2598@example 2599$ gawk 'BEGIN @{ print "dont panic" # a friendly \ 2600> BEGIN rule 2601> @}' 2602@error{} gawk: cmd. line:2: BEGIN rule 2603@error{} gawk: cmd. line:2: ^ parse error 2604@end example 2605 2606@noindent 2607In this case, it looks like the backslash would continue the comment onto the 2608next line. However, the backslash-newline combination is never even 2609noticed because it is ``hidden'' inside the comment. Thus, the 2610@code{BEGIN} is noted as a syntax error. 2611 2612@cindex statements, multiple 2613@cindex @code{;} (semicolon) 2614@cindex semicolon (@code{;}) 2615When @command{awk} statements within one rule are short, you might want to put 2616more than one of them on a line. This is accomplished by separating the statements 2617with a semicolon (@samp{;}). 2618This also applies to the rules themselves. 2619Thus, the program shown at the start of this @value{SECTION} 2620could also be written this way: 2621 2622@example 2623/12/ @{ print $0 @} ; /21/ @{ print $0 @} 2624@end example 2625 2626@noindent 2627@strong{Note:} The requirement that states that rules on the same line must be 2628separated with a semicolon was not in the original @command{awk} 2629language; it was added for consistency with the treatment of statements 2630within an action. 2631 2632@node Other Features 2633@section Other Features of @command{awk} 2634 2635@cindex variables 2636The @command{awk} language provides a number of predefined, or 2637@dfn{built-in}, variables that your programs can use to get information 2638from @command{awk}. There are other variables your program can set 2639as well to control how @command{awk} processes your data. 2640 2641In addition, @command{awk} provides a number of built-in functions for doing 2642common computational and string-related operations. 2643@command{gawk} provides built-in functions for working with timestamps, 2644performing bit manipulation, and for runtime string translation. 2645 2646As we develop our presentation of the @command{awk} language, we introduce 2647most of the variables and many of the functions. They are defined 2648systematically in @ref{Built-in Variables}, and 2649@ref{Built-in}. 2650 2651@node When 2652@section When to Use @command{awk} 2653 2654@cindex @command{awk}, uses for 2655Now that you've seen some of what @command{awk} can do, 2656you might wonder how @command{awk} could be useful for you. By using 2657utility programs, advanced patterns, field separators, arithmetic 2658statements, and other selection criteria, you can produce much more 2659complex output. The @command{awk} language is very useful for producing 2660reports from large amounts of raw data, such as summarizing information 2661from the output of other utility programs like @command{ls}. 2662(@xref{More Complex}.) 2663 2664Programs written with @command{awk} are usually much smaller than they would 2665be in other languages. This makes @command{awk} programs easy to compose and 2666use. Often, @command{awk} programs can be quickly composed at your terminal, 2667used once, and thrown away. Because @command{awk} programs are interpreted, you 2668can avoid the (usually lengthy) compilation part of the typical 2669edit-compile-test-debug cycle of software development. 2670 2671Complex programs have been written in @command{awk}, including a complete 2672retargetable assembler for eight-bit microprocessors (@pxref{Glossary}, for 2673more information), and a microcode assembler for a special-purpose Prolog 2674computer. However, @command{awk}'s capabilities are strained by tasks of 2675such complexity. 2676 2677@cindex @command{awk} programs, complex 2678If you find yourself writing @command{awk} scripts of more than, say, a few 2679hundred lines, you might consider using a different programming 2680language. Emacs Lisp is a good choice if you need sophisticated string 2681or pattern matching capabilities. The shell is also good at string and 2682pattern matching; in addition, it allows powerful use of the system 2683utilities. More conventional languages, such as C, C++, and Java, offer 2684better facilities for system programming and for managing the complexity 2685of large programs. Programs in these languages may require more lines 2686of source code than the equivalent @command{awk} programs, but they are 2687easier to maintain and usually run more efficiently. 2688 2689@node Regexp 2690@chapter Regular Expressions 2691@cindex regexp, See regular expressions 2692@c STARTOFRANGE regexp 2693@cindex regular expressions 2694 2695A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a 2696set of strings. 2697Because regular expressions are such a fundamental part of @command{awk} 2698programming, their format and use deserve a separate @value{CHAPTER}. 2699 2700@cindex forward slash (@code{/}) 2701@cindex @code{/} (forward slash) 2702A regular expression enclosed in slashes (@samp{/}) 2703is an @command{awk} pattern that matches every input record whose text 2704belongs to that set. 2705The simplest regular expression is a sequence of letters, numbers, or 2706both. Such a regexp matches any string that contains that sequence. 2707Thus, the regexp @samp{foo} matches any string containing @samp{foo}. 2708Therefore, the pattern @code{/foo/} matches any input record containing 2709the three characters @samp{foo} @emph{anywhere} in the record. Other 2710kinds of regexps let you specify more complicated classes of strings. 2711 2712@ifnotinfo 2713Initially, the examples in this @value{CHAPTER} are simple. 2714As we explain more about how 2715regular expressions work, we will present more complicated instances. 2716@end ifnotinfo 2717 2718@menu 2719* Regexp Usage:: How to Use Regular Expressions. 2720* Escape Sequences:: How to write nonprinting characters. 2721* Regexp Operators:: Regular Expression Operators. 2722* Character Lists:: What can go between @samp{[...]}. 2723* GNU Regexp Operators:: Operators specific to GNU software. 2724* Case-sensitivity:: How to do case-insensitive matching. 2725* Leftmost Longest:: How much text matches. 2726* Computed Regexps:: Using Dynamic Regexps. 2727* Locales:: How the locale affects things. 2728@end menu 2729 2730@node Regexp Usage 2731@section How to Use Regular Expressions 2732 2733@cindex regular expressions, as patterns 2734A regular expression can be used as a pattern by enclosing it in 2735slashes. Then the regular expression is tested against the 2736entire text of each record. (Normally, it only needs 2737to match some part of the text in order to succeed.) For example, the 2738following prints the second field of each record that contains the string 2739@samp{foo} anywhere in it: 2740 2741@example 2742$ awk '/foo/ @{ print $2 @}' BBS-list 2743@print{} 555-1234 2744@print{} 555-6699 2745@print{} 555-6480 2746@print{} 555-2127 2747@end example 2748 2749@cindex regular expressions, operators 2750@cindex operators, string-matching 2751@c @cindex operators, @code{~} 2752@cindex string-matching operators 2753@code{~} (tilde), @code{~} operator 2754@cindex tilde (@code{~}), @code{~} operator 2755@cindex @code{!} (exclamation point), @code{!~} operator 2756@cindex exclamation point (@code{!}), @code{!~} operator 2757@c @cindex operators, @code{!~} 2758@cindex @code{if} statement 2759@cindex @code{while} statement 2760@cindex @code{do}-@code{while} statement 2761@c @cindex statements, @code{if} 2762@c @cindex statements, @code{while} 2763@c @cindex statements, @code{do} 2764Regular expressions can also be used in matching expressions. These 2765expressions allow you to specify the string to match against; it need 2766not be the entire current input record. The two operators @samp{~} 2767and @samp{!~} perform regular expression comparisons. Expressions 2768using these operators can be used as patterns, or in @code{if}, 2769@code{while}, @code{for}, and @code{do} statements. 2770(@xref{Statements}.) 2771For example: 2772 2773@example 2774@var{exp} ~ /@var{regexp}/ 2775@end example 2776 2777@noindent 2778is true if the expression @var{exp} (taken as a string) 2779matches @var{regexp}. The following example matches, or selects, 2780all input records with the uppercase letter @samp{J} somewhere in the 2781first field: 2782 2783@example 2784$ awk '$1 ~ /J/' inventory-shipped 2785@print{} Jan 13 25 15 115 2786@print{} Jun 31 42 75 492 2787@print{} Jul 24 34 67 436 2788@print{} Jan 21 36 64 620 2789@end example 2790 2791So does this: 2792 2793@example 2794awk '@{ if ($1 ~ /J/) print @}' inventory-shipped 2795@end example 2796 2797This next example is true if the expression @var{exp} 2798(taken as a character string) 2799does @emph{not} match @var{regexp}: 2800 2801@example 2802@var{exp} !~ /@var{regexp}/ 2803@end example 2804 2805The following example matches, 2806or selects, all input records whose first field @emph{does not} contain 2807the uppercase letter @samp{J}: 2808 2809@example 2810$ awk '$1 !~ /J/' inventory-shipped 2811@print{} Feb 15 32 24 226 2812@print{} Mar 15 24 34 228 2813@print{} Apr 31 52 63 420 2814@print{} May 16 34 29 208 2815@dots{} 2816@end example 2817 2818@cindex regexp constants 2819@cindex regular expressions, constants, See regexp constants 2820When a regexp is enclosed in slashes, such as @code{/foo/}, we call it 2821a @dfn{regexp constant}, much like @code{5.27} is a numeric constant and 2822@code{"foo"} is a string constant. 2823 2824@node Escape Sequences 2825@section Escape Sequences 2826 2827@cindex escape sequences 2828@cindex backslash (@code{\}), in escape sequences 2829@cindex @code{\} (backslash), in escape sequences 2830Some characters cannot be included literally in string constants 2831(@code{"foo"}) or regexp constants (@code{/foo/}). 2832Instead, they should be represented with @dfn{escape sequences}, 2833which are character sequences beginning with a backslash (@samp{\}). 2834One use of an escape sequence is to include a double-quote character in 2835a string constant. Because a plain double quote ends the string, you 2836must use @samp{\"} to represent an actual double-quote character as a 2837part of the string. For example: 2838 2839@example 2840$ awk 'BEGIN @{ print "He said \"hi!\" to her." @}' 2841@print{} He said "hi!" to her. 2842@end example 2843 2844The backslash character itself is another character that cannot be 2845included normally; you must write @samp{\\} to put one backslash in the 2846string or regexp. Thus, the string whose contents are the two characters 2847@samp{"} and @samp{\} must be written @code{"\"\\"}. 2848 2849Backslash also represents unprintable characters 2850such as TAB or newline. While there is nothing to stop you from entering most 2851unprintable characters directly in a string constant or regexp constant, 2852they may look ugly. 2853 2854The following table lists 2855all the escape sequences used in @command{awk} and 2856what they represent. Unless noted otherwise, all these escape 2857sequences apply to both string constants and regexp constants: 2858 2859@table @code 2860@item \\ 2861A literal backslash, @samp{\}. 2862 2863@c @cindex @command{awk} language, V.4 version 2864@cindex @code{\} (backslash), @code{\a} escape sequence 2865@cindex backslash (@code{\}), @code{\a} escape sequence 2866@item \a 2867The ``alert'' character, @kbd{@value{CTL}-g}, ASCII code 7 (BEL). 2868(This usually makes some sort of audible noise.) 2869 2870@cindex @code{\} (backslash), @code{\b} escape sequence 2871@cindex backslash (@code{\}), @code{\b} escape sequence 2872@item \b 2873Backspace, @kbd{@value{CTL}-h}, ASCII code 8 (BS). 2874 2875@cindex @code{\} (backslash), @code{\f} escape sequence 2876@cindex backslash (@code{\}), @code{\f} escape sequence 2877@item \f 2878Formfeed, @kbd{@value{CTL}-l}, ASCII code 12 (FF). 2879 2880@cindex @code{\} (backslash), @code{\n} escape sequence 2881@cindex backslash (@code{\}), @code{\n} escape sequence 2882@item \n 2883Newline, @kbd{@value{CTL}-j}, ASCII code 10 (LF). 2884 2885@cindex @code{\} (backslash), @code{\r} escape sequence 2886@cindex backslash (@code{\}), @code{\r} escape sequence 2887@item \r 2888Carriage return, @kbd{@value{CTL}-m}, ASCII code 13 (CR). 2889 2890@cindex @code{\} (backslash), @code{\t} escape sequence 2891@cindex backslash (@code{\}), @code{\t} escape sequence 2892@item \t 2893Horizontal TAB, @kbd{@value{CTL}-i}, ASCII code 9 (HT). 2894 2895@c @cindex @command{awk} language, V.4 version 2896@cindex @code{\} (backslash), @code{\v} escape sequence 2897@cindex backslash (@code{\}), @code{\v} escape sequence 2898@item \v 2899Vertical tab, @kbd{@value{CTL}-k}, ASCII code 11 (VT). 2900 2901@cindex @code{\} (backslash), @code{\}@var{nnn} escape sequence 2902@cindex backslash (@code{\}), @code{\}@var{nnn} escape sequence 2903@item \@var{nnn} 2904The octal value @var{nnn}, where @var{nnn} stands for 1 to 3 digits 2905between @samp{0} and @samp{7}. For example, the code for the ASCII ESC 2906(escape) character is @samp{\033}. 2907 2908@c @cindex @command{awk} language, V.4 version 2909@c @cindex @command{awk} language, POSIX version 2910@cindex @code{\} (backslash), @code{\x} escape sequence 2911@cindex backslash (@code{\}), @code{\x} escape sequence 2912@item \x@var{hh}@dots{} 2913The hexadecimal value @var{hh}, where @var{hh} stands for a sequence 2914of hexadecimal digits (@samp{0}--@samp{9}, and either @samp{A}--@samp{F} 2915or @samp{a}--@samp{f}). Like the same construct 2916in ISO C, the escape sequence continues until the first nonhexadecimal 2917digit is seen. However, using more than two hexadecimal digits produces 2918undefined results. (The @samp{\x} escape sequence is not allowed in 2919POSIX @command{awk}.) 2920 2921@cindex @code{\} (backslash), @code{\/} escape sequence 2922@cindex backslash (@code{\}), @code{\/} escape sequence 2923@item \/ 2924A literal slash (necessary for regexp constants only). 2925This expression is used when you want to write a regexp 2926constant that contains a slash. Because the regexp is delimited by 2927slashes, you need to escape the slash that is part of the pattern, 2928in order to tell @command{awk} to keep processing the rest of the regexp. 2929 2930@cindex @code{\} (backslash), @code{\"} escape sequence 2931@cindex backslash (@code{\}), @code{\"} escape sequence 2932@item \" 2933A literal double quote (necessary for string constants only). 2934This expression is used when you want to write a string 2935constant that contains a double quote. Because the string is delimited by 2936double quotes, you need to escape the quote that is part of the string, 2937in order to tell @command{awk} to keep processing the rest of the string. 2938@end table 2939 2940In @command{gawk}, a number of additional two-character sequences that begin 2941with a backslash have special meaning in regexps. 2942@xref{GNU Regexp Operators}. 2943 2944In a regexp, a backslash before any character that is not in the previous list 2945and not listed in 2946@ref{GNU Regexp Operators}, 2947means that the next character should be taken literally, even if it would 2948normally be a regexp operator. For example, @code{/a\+b/} matches the three 2949characters @samp{a+b}. 2950 2951@cindex backslash (@code{\}), in escape sequences 2952@cindex @code{\} (backslash), in escape sequences 2953@cindex portability 2954For complete portability, do not use a backslash before any character not 2955shown in the previous list. 2956 2957To summarize: 2958 2959@itemize @bullet 2960@item 2961The escape sequences in the table above are always processed first, 2962for both string constants and regexp constants. This happens very early, 2963as soon as @command{awk} reads your program. 2964 2965@item 2966@command{gawk} processes both regexp constants and dynamic regexps 2967(@pxref{Computed Regexps}), 2968for the special operators listed in 2969@ref{GNU Regexp Operators}. 2970 2971@item 2972A backslash before any other character means to treat that character 2973literally. 2974@end itemize 2975 2976@c fakenode --- for prepinfo 2977@subheading Advanced Notes: Backslash Before Regular Characters 2978@cindex portability, backslash in escape sequences 2979@cindex POSIX @command{awk}, backslashes in string constants 2980@cindex backslash (@code{\}), in escape sequences, POSIX and 2981@cindex @code{\} (backslash), in escape sequences, POSIX and 2982 2983@cindex troubleshooting, backslash before nonspecial character 2984If you place a backslash in a string constant before something that is 2985not one of the characters previously listed, POSIX @command{awk} purposely 2986leaves what happens as undefined. There are two choices: 2987 2988@c @cindex automatic warnings 2989@c @cindex warnings, automatic 2990@table @asis 2991@item Strip the backslash out 2992This is what Unix @command{awk} and @command{gawk} both do. 2993For example, @code{"a\qc"} is the same as @code{"aqc"}. 2994(Because this is such an easy bug both to introduce and to miss, 2995@command{gawk} warns you about it.) 2996Consider @samp{FS = @w{"[ \t]+\|[ \t]+"}} to use vertical bars 2997surrounded by whitespace as the field separator. There should be 2998two backslashes in the string @samp{FS = @w{"[ \t]+\\|[ \t]+"}}.) 2999@c I did this! This is why I added the warning. 3000 3001@cindex @command{gawk}, escape sequences 3002@cindex Unix @command{awk}, backslashes in escape sequences 3003@item Leave the backslash alone 3004Some other @command{awk} implementations do this. 3005In such implementations, typing @code{"a\qc"} is the same as typing 3006@code{"a\\qc"}. 3007@end table 3008 3009@c fakenode --- for prepinfo 3010@subheading Advanced Notes: Escape Sequences for Metacharacters 3011@cindex metacharacters, escape sequences for 3012 3013Suppose you use an octal or hexadecimal 3014escape to represent a regexp metacharacter. 3015(See @ref{Regexp Operators}.) 3016Does @command{awk} treat the character as a literal character or as a regexp 3017operator? 3018 3019@cindex dark corner, escape sequences, for metacharacters 3020Historically, such characters were taken literally. 3021@value{DARKCORNER} 3022However, the POSIX standard indicates that they should be treated 3023as real metacharacters, which is what @command{gawk} does. 3024In compatibility mode (@pxref{Options}), 3025@command{gawk} treats the characters represented by octal and hexadecimal 3026escape sequences literally when used in regexp constants. Thus, 3027@code{/a\52b/} is equivalent to @code{/a\*b/}. 3028 3029@node Regexp Operators 3030@section Regular Expression Operators 3031@c STARTOFRANGE regexpo 3032@cindex regular expressions, operators 3033 3034You can combine regular expressions with special characters, 3035called @dfn{regular expression operators} or @dfn{metacharacters}, to 3036increase the power and versatility of regular expressions. 3037 3038The escape sequences described 3039@ifnotinfo 3040earlier 3041@end ifnotinfo 3042in @ref{Escape Sequences}, 3043are valid inside a regexp. They are introduced by a @samp{\} and 3044are recognized and converted into corresponding real characters as 3045the very first step in processing regexps. 3046 3047Here is a list of metacharacters. All characters that are not escape 3048sequences and that are not listed in the table stand for themselves: 3049 3050@table @code 3051@cindex backslash (@code{\}) 3052@cindex @code{\} (backslash) 3053@item \ 3054This is used to suppress the special meaning of a character when 3055matching. For example, @samp{\$} 3056matches the character @samp{$}. 3057 3058@cindex regular expressions, anchors in 3059@cindex Texinfo, chapter beginnings in files 3060@cindex @code{^} (caret) 3061@cindex caret (@code{^}) 3062@item ^ 3063This matches the beginning of a string. For example, @samp{^@@chapter} 3064matches @samp{@@chapter} at the beginning of a string and can be used 3065to identify chapter beginnings in Texinfo source files. 3066The @samp{^} is known as an @dfn{anchor}, because it anchors the pattern to 3067match only at the beginning of the string. 3068 3069It is important to realize that @samp{^} does not match the beginning of 3070a line embedded in a string. 3071The condition is not true in the following example: 3072 3073@example 3074if ("line1\nLINE 2" ~ /^L/) @dots{} 3075@end example 3076 3077@cindex @code{$} (dollar sign) 3078@cindex dollar sign (@code{$}) 3079@item $ 3080This is similar to @samp{^}, but it matches only at the end of a string. 3081For example, @samp{p$} 3082matches a record that ends with a @samp{p}. The @samp{$} is an anchor 3083and does not match the end of a line embedded in a string. 3084The condition in the following example is not true: 3085 3086@example 3087if ("line1\nLINE 2" ~ /1$/) @dots{} 3088@end example 3089 3090@cindex @code{.} (period) 3091@cindex period (@code{.}) 3092@item . 3093This matches any single character, 3094@emph{including} the newline character. For example, @samp{.P} 3095matches any single character followed by a @samp{P} in a string. Using 3096concatenation, we can make a regular expression such as @samp{U.A}, which 3097matches any three-character sequence that begins with @samp{U} and ends 3098with @samp{A}. 3099 3100@c comma before using does NOT do tertiary 3101@cindex POSIX @command{awk}, period (@code{.}), using 3102In strict POSIX mode (@pxref{Options}), 3103@samp{.} does not match the @sc{nul} 3104character, which is a character with all bits equal to zero. 3105Otherwise, @sc{nul} is just another character. Other versions of @command{awk} 3106may not be able to match the @sc{nul} character. 3107 3108@cindex @code{[]} (square brackets) 3109@cindex square brackets (@code{[]}) 3110@cindex character lists 3111@cindex character sets, See Also character lists 3112@cindex bracket expressions, See character lists 3113@item [@dots{}] 3114This is called a @dfn{character list}.@footnote{In other literature, 3115you may see a character list referred to as either a 3116@dfn{character set}, a @dfn{character class}, or a @dfn{bracket expression}.} 3117It matches any @emph{one} of the characters that are enclosed in 3118the square brackets. For example, @samp{[MVX]} matches any one of 3119the characters @samp{M}, @samp{V}, or @samp{X} in a string. A full 3120discussion of what can be inside the square brackets of a character list 3121is given in 3122@ref{Character Lists}. 3123 3124@cindex character lists, complemented 3125@item [^ @dots{}] 3126This is a @dfn{complemented character list}. The first character after 3127the @samp{[} @emph{must} be a @samp{^}. It matches any characters 3128@emph{except} those in the square brackets. For example, @samp{[^awk]} 3129matches any character that is not an @samp{a}, @samp{w}, 3130or @samp{k}. 3131 3132@cindex @code{|} (vertical bar) 3133@cindex vertical bar (@code{|}) 3134@item | 3135This is the @dfn{alternation operator} and it is used to specify 3136alternatives. 3137The @samp{|} has the lowest precedence of all the regular 3138expression operators. 3139For example, @samp{^P|[[:digit:]]} 3140matches any string that matches either @samp{^P} or @samp{[[:digit:]]}. This 3141means it matches any string that starts with @samp{P} or contains a digit. 3142 3143The alternation applies to the largest possible regexps on either side. 3144 3145@cindex @code{()} (parentheses) 3146@cindex parentheses @code{()} 3147@item (@dots{}) 3148Parentheses are used for grouping in regular expressions, as in 3149arithmetic. They can be used to concatenate regular expressions 3150containing the alternation operator, @samp{|}. For example, 3151@samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and 3152@samp{@@samp@{bar@}}. 3153(These are Texinfo formatting control sequences. The @samp{+} is 3154explained further on in this list.) 3155 3156@cindex @code{*} (asterisk), @code{*} operator, as regexp operator 3157@cindex asterisk (@code{*}), @code{*} operator, as regexp operator 3158@item * 3159This symbol means that the preceding regular expression should be 3160repeated as many times as necessary to find a match. For example, @samp{ph*} 3161applies the @samp{*} symbol to the preceding @samp{h} and looks for matches 3162of one @samp{p} followed by any number of @samp{h}s. This also matches 3163just @samp{p} if no @samp{h}s are present. 3164 3165The @samp{*} repeats the @emph{smallest} possible preceding expression. 3166(Use parentheses if you want to repeat a larger expression.) It finds 3167as many repetitions as possible. For example, 3168@samp{awk '/\(c[ad][ad]*r x\)/ @{ print @}' sample} 3169prints every record in @file{sample} containing a string of the form 3170@samp{(car x)}, @samp{(cdr x)}, @samp{(cadr x)}, and so on. 3171Notice the escaping of the parentheses by preceding them 3172with backslashes. 3173 3174@cindex @code{+} (plus sign) 3175@cindex plus sign (@code{+}) 3176@item + 3177This symbol is similar to @samp{*}, except that the preceding expression must be 3178matched at least once. This means that @samp{wh+y} 3179would match @samp{why} and @samp{whhy}, but not @samp{wy}, whereas 3180@samp{wh*y} would match all three of these strings. 3181The following is a simpler 3182way of writing the last @samp{*} example: 3183 3184@example 3185awk '/\(c[ad]+r x\)/ @{ print @}' sample 3186@end example 3187 3188@cindex @code{?} (question mark) 3189@cindex question mark (@code{?}) 3190@item ? 3191This symbol is similar to @samp{*}, except that the preceding expression can be 3192matched either once or not at all. For example, @samp{fe?d} 3193matches @samp{fed} and @samp{fd}, but nothing else. 3194 3195@cindex interval expressions 3196@item @{@var{n}@} 3197@itemx @{@var{n},@} 3198@itemx @{@var{n},@var{m}@} 3199One or two numbers inside braces denote an @dfn{interval expression}. 3200If there is one number in the braces, the preceding regexp is repeated 3201@var{n} times. 3202If there are two numbers separated by a comma, the preceding regexp is 3203repeated @var{n} to @var{m} times. 3204If there is one number followed by a comma, then the preceding regexp 3205is repeated at least @var{n} times: 3206 3207@table @code 3208@item wh@{3@}y 3209Matches @samp{whhhy}, but not @samp{why} or @samp{whhhhy}. 3210 3211@item wh@{3,5@}y 3212Matches @samp{whhhy}, @samp{whhhhy}, or @samp{whhhhhy}, only. 3213 3214@item wh@{2,@}y 3215Matches @samp{whhy} or @samp{whhhy}, and so on. 3216@end table 3217 3218@cindex POSIX @command{awk}, interval expressions in 3219Interval expressions were not traditionally available in @command{awk}. 3220They were added as part of the POSIX standard to make @command{awk} 3221and @command{egrep} consistent with each other. 3222 3223@cindex @command{gawk}, interval expressions and 3224However, because old programs may use @samp{@{} and @samp{@}} in regexp 3225constants, by default @command{gawk} does @emph{not} match interval expressions 3226in regexps. If either @option{--posix} or @option{--re-interval} are specified 3227(@pxref{Options}), then interval expressions 3228are allowed in regexps. 3229 3230For new programs that use @samp{@{} and @samp{@}} in regexp constants, 3231it is good practice to always escape them with a backslash. Then the 3232regexp constants are valid and work the way you want them to, using 3233any version of @command{awk}.@footnote{Use two backslashes if you're 3234using a string constant with a regexp operator or function.} 3235@end table 3236 3237@cindex precedence, regexp operators 3238@cindex regular expressions, operators, precedence of 3239In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators, 3240as well as the braces @samp{@{} and @samp{@}}, 3241have 3242the highest precedence, followed by concatenation, and finally by @samp{|}. 3243As in arithmetic, parentheses can change how operators are grouped. 3244 3245@cindex POSIX @command{awk}, regular expressions and 3246@cindex @command{gawk}, regular expressions, precedence 3247In POSIX @command{awk} and @command{gawk}, the @samp{*}, @samp{+}, and @samp{?} operators 3248stand for themselves when there is nothing in the regexp that precedes them. 3249For example, @samp{/+/} matches a literal plus sign. However, many other versions of 3250@command{awk} treat such a usage as a syntax error. 3251 3252If @command{gawk} is in compatibility mode 3253(@pxref{Options}), 3254POSIX character classes and interval expressions are not available in 3255regular expressions. 3256@c ENDOFRANGE regexpo 3257 3258@node Character Lists 3259@section Using Character Lists 3260@c STARTOFRANGE charlist 3261@cindex character lists 3262@cindex character lists, range expressions 3263@cindex range expressions 3264 3265Within a character list, a @dfn{range expression} consists of two 3266characters separated by a hyphen. It matches any single character that 3267sorts between the two characters, using the locale's 3268collating sequence and character set. For example, in the default C 3269locale, @samp{[a-dx-z]} is equivalent to @samp{[abcdxyz]}. Many locales 3270sort characters in dictionary order, and in these locales, 3271@samp{[a-dx-z]} is typically not equivalent to @samp{[abcdxyz]}; instead it 3272might be equivalent to @samp{[aBbCcDdxXyYz]}, for example. To obtain 3273the traditional interpretation of bracket expressions, you can use the C 3274locale by setting the @env{LC_ALL} environment variable to the value 3275@samp{C}. 3276 3277@cindex @code{\} (backslash), in character lists 3278@cindex backslash (@code{\}), in character lists 3279@cindex @code{^} (caret), in character lists 3280@cindex caret (@code{^}), in character lists 3281@cindex @code{-} (hyphen), in character lists 3282@cindex hyphen (@code{-}), in character lists 3283To include one of the characters @samp{\}, @samp{]}, @samp{-}, or @samp{^} in a 3284character list, put a @samp{\} in front of it. For example: 3285 3286@example 3287[d\]] 3288@end example 3289 3290@noindent 3291matches either @samp{d} or @samp{]}. 3292 3293@cindex POSIX @command{awk}, character lists and 3294@cindex Extended Regular Expressions (EREs) 3295@cindex EREs (Extended Regular Expressions) 3296@cindex @command{egrep} utility 3297This treatment of @samp{\} in character lists 3298is compatible with other @command{awk} 3299implementations and is also mandated by POSIX. 3300The regular expressions in @command{awk} are a superset 3301of the POSIX specification for Extended Regular Expressions (EREs). 3302POSIX EREs are based on the regular expressions accepted by the 3303traditional @command{egrep} utility. 3304 3305@cindex character lists, character classes 3306@cindex POSIX @command{awk}, character lists and, character classes 3307@dfn{Character classes} are a new feature introduced in the POSIX standard. 3308A character class is a special notation for describing 3309lists of characters that have a specific attribute, but the 3310actual characters can vary from country to country and/or 3311from character set to character set. For example, the notion of what 3312is an alphabetic character differs between the United States and France. 3313 3314A character class is only valid in a regexp @emph{inside} the 3315brackets of a character list. Character classes consist of @samp{[:}, 3316a keyword denoting the class, and @samp{:]}. Here are the character 3317classes defined by the POSIX standard. 3318 3319@c the regular table is commented out while trying out the multitable. 3320@c leave it here in case we need to go back, but make sure the text 3321@c still corresponds! 3322 3323@ignore 3324@table @code 3325@item [:alnum:] 3326Alphanumeric characters. 3327 3328@item [:alpha:] 3329Alphabetic characters. 3330 3331@item [:blank:] 3332Space and TAB characters. 3333 3334@item [:cntrl:] 3335Control characters. 3336 3337@item [:digit:] 3338Numeric characters. 3339 3340@item [:graph:] 3341Characters that are printable and visible. 3342(A space is printable but not visible, whereas an @samp{a} is both.) 3343 3344@item [:lower:] 3345Lowercase alphabetic characters. 3346 3347@item [:print:] 3348Printable characters (characters that are not control characters). 3349 3350@item [:punct:] 3351Punctuation characters (characters that are not letters, digits, 3352control characters, or space characters). 3353 3354@item [:space:] 3355Space characters (such as space, TAB, and formfeed, to name a few). 3356 3357@item [:upper:] 3358Uppercase alphabetic characters. 3359 3360@item [:xdigit:] 3361Characters that are hexadecimal digits. 3362@end table 3363@end ignore 3364 3365@multitable {@code{[:xdigit:]}} {Characters that are both printable and visible. (A space is} 3366@item @code{[:alnum:]} @tab Alphanumeric characters. 3367@item @code{[:alpha:]} @tab Alphabetic characters. 3368@item @code{[:blank:]} @tab Space and TAB characters. 3369@item @code{[:cntrl:]} @tab Control characters. 3370@item @code{[:digit:]} @tab Numeric characters. 3371@item @code{[:graph:]} @tab Characters that are both printable and visible. 3372(A space is printable but not visible, whereas an @samp{a} is both.) 3373@item @code{[:lower:]} @tab Lowercase alphabetic characters. 3374@item @code{[:print:]} @tab Printable characters (characters that are not control characters). 3375@item @code{[:punct:]} @tab Punctuation characters (characters that are not letters, digits, 3376control characters, or space characters). 3377@item @code{[:space:]} @tab Space characters (such as space, TAB, and formfeed, to name a few). 3378@item @code{[:upper:]} @tab Uppercase alphabetic characters. 3379@item @code{[:xdigit:]} @tab Characters that are hexadecimal digits. 3380@end multitable 3381 3382For example, before the POSIX standard, you had to write @code{/[A-Za-z0-9]/} 3383to match alphanumeric characters. If your 3384character set had other alphabetic characters in it, this would not 3385match them, and if your character set collated differently from 3386ASCII, this might not even match the ASCII alphanumeric characters. 3387With the POSIX character classes, you can write 3388@code{/[[:alnum:]]/} to match the alphabetic 3389and numeric characters in your character set. 3390 3391@cindex character lists, collating elements 3392@cindex character lists, non-ASCII 3393@cindex collating elements 3394Two additional special sequences can appear in character lists. 3395These apply to non-ASCII character sets, which can have single symbols 3396(called @dfn{collating elements}) that are represented with more than one 3397character. They can also have several characters that are equivalent for 3398@dfn{collating}, or sorting, purposes. (For example, in French, a plain ``e'' 3399and a grave-accented ``@`e'' are equivalent.) 3400These sequences are: 3401 3402@table @asis 3403@cindex character lists, collating symbols 3404@cindex collating symbols 3405@item Collating symbols 3406Multicharacter collating elements enclosed between 3407@samp{[.} and @samp{.]}. For example, if @samp{ch} is a collating element, 3408then @code{[[.ch.]]} is a regexp that matches this collating element, whereas 3409@code{[ch]} is a regexp that matches either @samp{c} or @samp{h}. 3410 3411@cindex character lists, equivalence classes 3412@item Equivalence classes 3413Locale-specific names for a list of 3414characters that are equal. The name is enclosed between 3415@samp{[=} and @samp{=]}. 3416For example, the name @samp{e} might be used to represent all of 3417``e,'' ``@`e,'' and ``@'e.'' In this case, @code{[[=e=]]} is a regexp 3418that matches any of @samp{e}, @samp{@'e}, or @samp{@`e}. 3419@end table 3420 3421These features are very valuable in non-English-speaking locales. 3422 3423@cindex internationalization, localization, character classes 3424@cindex @command{gawk}, character classes and 3425@cindex POSIX @command{awk}, character lists and, character classes 3426@strong{Caution:} The library functions that @command{gawk} uses for regular 3427expression matching currently recognize only POSIX character classes; 3428they do not recognize collating symbols or equivalence classes. 3429@c maybe one day ... 3430@c ENDOFRANGE charlist 3431 3432@node GNU Regexp Operators 3433@section @command{gawk}-Specific Regexp Operators 3434 3435@c This section adapted (long ago) from the regex-0.12 manual 3436 3437@c STARTOFRANGE regexpg 3438@cindex regular expressions, operators, @command{gawk} 3439@c STARTOFRANGE gregexp 3440@cindex @command{gawk}, regular expressions, operators 3441@cindex operators, GNU-specific 3442@cindex regular expressions, operators, for words 3443@cindex word, regexp definition of 3444GNU software that deals with regular expressions provides a number of 3445additional regexp operators. These operators are described in this 3446@value{SECTION} and are specific to @command{gawk}; 3447they are not available in other @command{awk} implementations. 3448Most of the additional operators deal with word matching. 3449For our purposes, a @dfn{word} is a sequence of one or more letters, digits, 3450or underscores (@samp{_}): 3451 3452@table @code 3453@c @cindex operators, @code{\w} (@command{gawk}) 3454@cindex backslash (@code{\}), @code{\w} operator (@command{gawk}) 3455@cindex @code{\} (backslash), @code{\w} operator (@command{gawk}) 3456@item \w 3457Matches any word-constituent character---that is, it matches any 3458letter, digit, or underscore. Think of it as shorthand for 3459@w{@code{[[:alnum:]_]}}. 3460 3461@c @cindex operators, @code{\W} (@command{gawk}) 3462@cindex backslash (@code{\}), @code{\W} operator (@command{gawk}) 3463@cindex @code{\} (backslash), @code{\W} operator (@command{gawk}) 3464@item \W 3465Matches any character that is not word-constituent. 3466Think of it as shorthand for 3467@w{@code{[^[:alnum:]_]}}. 3468 3469@c @cindex operators, @code{\<} (@command{gawk}) 3470@cindex backslash (@code{\}), @code{\<} operator (@command{gawk}) 3471@cindex @code{\} (backslash), @code{\<} operator (@command{gawk}) 3472@item \< 3473Matches the empty string at the beginning of a word. 3474For example, @code{/\<away/} matches @samp{away} but not 3475@samp{stowaway}. 3476 3477@c @cindex operators, @code{\>} (@command{gawk}) 3478@cindex backslash (@code{\}), @code{\>} operator (@command{gawk}) 3479@cindex @code{\} (backslash), @code{\>} operator (@command{gawk}) 3480@item \> 3481Matches the empty string at the end of a word. 3482For example, @code{/stow\>/} matches @samp{stow} but not @samp{stowaway}. 3483 3484@c @cindex operators, @code{\y} (@command{gawk}) 3485@cindex backslash (@code{\}), @code{\y} operator (@command{gawk}) 3486@cindex @code{\} (backslash), @code{\y} operator (@command{gawk}) 3487@c comma before using does NOT do secondary 3488@cindex word boundaries, matching 3489@item \y 3490Matches the empty string at either the beginning or the 3491end of a word (i.e., the word boundar@strong{y}). For example, @samp{\yballs?\y} 3492matches either @samp{ball} or @samp{balls}, as a separate word. 3493 3494@c @cindex operators, @code{\B} (@command{gawk}) 3495@cindex backslash (@code{\}), @code{\B} operator (@command{gawk}) 3496@cindex @code{\} (backslash), @code{\B} operator (@command{gawk}) 3497@item \B 3498Matches the empty string that occurs between two 3499word-constituent characters. For example, 3500@code{/\Brat\B/} matches @samp{crate} but it does not match @samp{dirty rat}. 3501@samp{\B} is essentially the opposite of @samp{\y}. 3502@end table 3503 3504@cindex buffers, operators for 3505@cindex regular expressions, operators, for buffers 3506@cindex operators, string-matching, for buffers 3507There are two other operators that work on buffers. In Emacs, a 3508@dfn{buffer} is, naturally, an Emacs buffer. For other programs, 3509@command{gawk}'s regexp library routines consider the entire 3510string to match as the buffer. 3511The operators are: 3512 3513@table @code 3514@item \` 3515@c @cindex operators, @code{\`} (@command{gawk}) 3516@cindex backslash (@code{\}), @code{\`} operator (@command{gawk}) 3517@cindex @code{\} (backslash), @code{\`} operator (@command{gawk}) 3518Matches the empty string at the 3519beginning of a buffer (string). 3520 3521@c @cindex operators, @code{\'} (@command{gawk}) 3522@cindex backslash (@code{\}), @code{\'} operator (@command{gawk}) 3523@cindex @code{\} (backslash), @code{\'} operator (@command{gawk}) 3524@item \' 3525Matches the empty string at the 3526end of a buffer (string). 3527@end table 3528 3529@cindex @code{^} (caret) 3530@cindex caret (@code{^}) 3531@cindex @code{?} (question mark) 3532@cindex question mark (@code{?}) 3533Because @samp{^} and @samp{$} always work in terms of the beginning 3534and end of strings, these operators don't add any new capabilities 3535for @command{awk}. They are provided for compatibility with other 3536GNU software. 3537 3538@cindex @command{gawk}, word-boundary operator 3539@cindex word-boundary operator (@command{gawk}) 3540@cindex operators, word-boundary (@command{gawk}) 3541In other GNU software, the word-boundary operator is @samp{\b}. However, 3542that conflicts with the @command{awk} language's definition of @samp{\b} 3543as backspace, so @command{gawk} uses a different letter. 3544An alternative method would have been to require two backslashes in the 3545GNU operators, but this was deemed too confusing. The current 3546method of using @samp{\y} for the GNU @samp{\b} appears to be the 3547lesser of two evils. 3548 3549@c NOTE!!! Keep this in sync with the same table in the summary appendix! 3550@c 3551@c Should really do this with file inclusion. 3552@cindex regular expressions, @command{gawk}, command-line options 3553@cindex @command{gawk}, command-line options 3554The various command-line options 3555(@pxref{Options}) 3556control how @command{gawk} interprets characters in regexps: 3557 3558@table @asis 3559@item No options 3560In the default case, @command{gawk} provides all the facilities of 3561POSIX regexps and the 3562@ifnotinfo 3563previously described 3564GNU regexp operators. 3565@end ifnotinfo 3566@ifnottex 3567GNU regexp operators described 3568in @ref{Regexp Operators}. 3569@end ifnottex 3570However, interval expressions are not supported. 3571 3572@item @code{--posix} 3573Only POSIX regexps are supported; the GNU operators are not special 3574(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions 3575are allowed. 3576 3577@item @code{--traditional} 3578Traditional Unix @command{awk} regexps are matched. The GNU operators 3579are not special, interval expressions are not available, nor 3580are the POSIX character classes (@code{[[:alnum:]]}, etc.). 3581Characters described by octal and hexadecimal escape sequences are 3582treated literally, even if they represent regexp metacharacters. 3583 3584@item @code{--re-interval} 3585Allow interval expressions in regexps, even if @option{--traditional} 3586has been provided. (@option{--posix} automatically enables 3587interval expressions, so @option{--re-interval} is redundant 3588when @option{--posix} is is used.) 3589@end table 3590@c ENDOFRANGE gregexp 3591@c ENDOFRANGE regexpg 3592 3593@node Case-sensitivity 3594@section Case Sensitivity in Matching 3595 3596@c STARTOFRANGE regexpcs 3597@cindex regular expressions, case sensitivity 3598@c STARTOFRANGE csregexp 3599@cindex case sensitivity, regexps and 3600Case is normally significant in regular expressions, both when matching 3601ordinary characters (i.e., not metacharacters) and inside character 3602sets. Thus, a @samp{w} in a regular expression matches only a lowercase 3603@samp{w} and not an uppercase @samp{W}. 3604 3605The simplest way to do a case-independent match is to use a character 3606list---for example, @samp{[Ww]}. However, this can be cumbersome if 3607you need to use it often, and it can make the regular expressions harder 3608to read. There are two alternatives that you might prefer. 3609 3610One way to perform a case-insensitive match at a particular point in the 3611program is to convert the data to a single case, using the 3612@code{tolower} or @code{toupper} built-in string functions (which we 3613haven't discussed yet; 3614@pxref{String Functions}). 3615For example: 3616 3617@example 3618tolower($1) ~ /foo/ @{ @dots{} @} 3619@end example 3620 3621@noindent 3622converts the first field to lowercase before matching against it. 3623This works in any POSIX-compliant @command{awk}. 3624 3625@cindex @command{gawk}, regular expressions, case sensitivity 3626@cindex case sensitivity, @command{gawk} 3627@cindex differences in @command{awk} and @command{gawk}, regular expressions 3628@cindex @code{~} (tilde), @code{~} operator 3629@cindex tilde (@code{~}), @code{~} operator 3630@cindex @code{!} (exclamation point), @code{!~} operator 3631@cindex exclamation point (@code{!}), @code{!~} operator 3632@cindex @code{IGNORECASE} variable 3633@c @cindex variables, @code{IGNORECASE} 3634Another method, specific to @command{gawk}, is to set the variable 3635@code{IGNORECASE} to a nonzero value (@pxref{Built-in Variables}). 3636When @code{IGNORECASE} is not zero, @emph{all} regexp and string 3637operations ignore case. Changing the value of 3638@code{IGNORECASE} dynamically controls the case-sensitivity of the 3639program as it runs. Case is significant by default because 3640@code{IGNORECASE} (like most variables) is initialized to zero: 3641 3642@example 3643x = "aB" 3644if (x ~ /ab/) @dots{} # this test will fail 3645 3646IGNORECASE = 1 3647if (x ~ /ab/) @dots{} # now it will succeed 3648@end example 3649 3650In general, you cannot use @code{IGNORECASE} to make certain rules 3651case-insensitive and other rules case-sensitive, because there is no 3652straightforward way 3653to set @code{IGNORECASE} just for the pattern of 3654a particular rule.@footnote{Experienced C and C++ programmers will note 3655that it is possible, using something like 3656@samp{IGNORECASE = 1 && /foObAr/ @{ @dots{} @}} 3657and 3658@samp{IGNORECASE = 0 || /foobar/ @{ @dots{} @}}. 3659However, this is somewhat obscure and we don't recommend it.} 3660To do this, use either character lists or @code{tolower}. However, one 3661thing you can do with @code{IGNORECASE} only is dynamically turn 3662case-sensitivity on or off for all the rules at once. 3663 3664@code{IGNORECASE} can be set on the command line or in a @code{BEGIN} rule 3665(@pxref{Other Arguments}; also 3666@pxref{Using BEGIN/END}). 3667Setting @code{IGNORECASE} from the command line is a way to make 3668a program case-insensitive without having to edit it. 3669 3670Prior to @command{gawk} 3.0, the value of @code{IGNORECASE} 3671affected regexp operations only. It did not affect string comparison 3672with @samp{==}, @samp{!=}, and so on. 3673Beginning with @value{PVERSION} 3.0, both regexp and string comparison 3674operations are also affected by @code{IGNORECASE}. 3675 3676@c @cindex ISO 8859-1 3677@c @cindex ISO Latin-1 3678Beginning with @command{gawk} 3.0, 3679the equivalences between upper- 3680and lowercase characters are based on the ISO-8859-1 (ISO Latin-1) 3681character set. This character set is a superset of the traditional 128 3682ASCII characters, which also provides a number of characters suitable 3683for use with European languages. 3684 3685The value of @code{IGNORECASE} has no effect if @command{gawk} is in 3686compatibility mode (@pxref{Options}). 3687Case is always significant in compatibility mode. 3688@c ENDOFRANGE csregexp 3689@c ENDOFRANGE regexpcs 3690 3691@node Leftmost Longest 3692@section How Much Text Matches? 3693 3694@cindex regular expressions, leftmost longest match 3695@c @cindex matching, leftmost longest 3696Consider the following: 3697 3698@example 3699echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}' 3700@end example 3701 3702This example uses the @code{sub} function (which we haven't discussed yet; 3703@pxref{String Functions}) 3704to make a change to the input record. Here, the regexp @code{/a+/} 3705indicates ``one or more @samp{a} characters,'' and the replacement 3706text is @samp{<A>}. 3707 3708The input contains four @samp{a} characters. 3709@command{awk} (and POSIX) regular expressions always match 3710the leftmost, @emph{longest} sequence of input characters that can 3711match. Thus, all four @samp{a} characters are 3712replaced with @samp{<A>} in this example: 3713 3714@example 3715$ echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}' 3716@print{} <A>bcd 3717@end example 3718 3719For simple match/no-match tests, this is not so important. But when doing 3720text matching and substitutions with the @code{match}, @code{sub}, @code{gsub}, 3721and @code{gensub} functions, it is very important. 3722@ifinfo 3723@xref{String Functions}, 3724for more information on these functions. 3725@end ifinfo 3726Understanding this principle is also important for regexp-based record 3727and field splitting (@pxref{Records}, 3728and also @pxref{Field Separators}). 3729 3730@node Computed Regexps 3731@section Using Dynamic Regexps 3732 3733@c STARTOFRANGE dregexp 3734@cindex regular expressions, computed 3735@c STARTOFRANGE regexpd 3736@cindex regular expressions, dynamic 3737@cindex @code{~} (tilde), @code{~} operator 3738@cindex tilde (@code{~}), @code{~} operator 3739@cindex @code{!} (exclamation point), @code{!~} operator 3740@cindex exclamation point (@code{!}), @code{!~} operator 3741@c @cindex operators, @code{~} 3742@c @cindex operators, @code{!~} 3743The righthand side of a @samp{~} or @samp{!~} operator need not be a 3744regexp constant (i.e., a string of characters between slashes). It may 3745be any expression. The expression is evaluated and converted to a string 3746if necessary; the contents of the string are used as the 3747regexp. A regexp that is computed in this way is called a @dfn{dynamic 3748regexp}: 3749 3750@example 3751BEGIN @{ digits_regexp = "[[:digit:]]+" @} 3752$0 ~ digits_regexp @{ print @} 3753@end example 3754 3755@noindent 3756This sets @code{digits_regexp} to a regexp that describes one or more digits, 3757and tests whether the input record matches this regexp. 3758 3759@c @strong{Caution:} 3760When using the @samp{~} and @samp{!~} 3761@strong{Caution:} When using the @samp{~} and @samp{!~} 3762operators, there is a difference between a regexp constant 3763enclosed in slashes and a string constant enclosed in double quotes. 3764If you are going to use a string constant, you have to understand that 3765the string is, in essence, scanned @emph{twice}: the first time when 3766@command{awk} reads your program, and the second time when it goes to 3767match the string on the lefthand side of the operator with the pattern 3768on the right. This is true of any string-valued expression (such as 3769@code{digits_regexp}, shown previously), not just string constants. 3770 3771@cindex regexp constants, slashes vs. quotes 3772@cindex @code{\} (backslash), regexp constants 3773@cindex backslash (@code{\}), regexp constants 3774@cindex @code{"} (double quote), regexp constants 3775@cindex double quote (@code{"}), regexp constants 3776What difference does it make if the string is 3777scanned twice? The answer has to do with escape sequences, and particularly 3778with backslashes. To get a backslash into a regular expression inside a 3779string, you have to type two backslashes. 3780 3781For example, @code{/\*/} is a regexp constant for a literal @samp{*}. 3782Only one backslash is needed. To do the same thing with a string, 3783you have to type @code{"\\*"}. The first backslash escapes the 3784second one so that the string actually contains the 3785two characters @samp{\} and @samp{*}. 3786 3787@cindex troubleshooting, regexp constants vs. string constants 3788@cindex regexp constants, vs. string constants 3789@cindex string constants, vs. regexp constants 3790Given that you can use both regexp and string constants to describe 3791regular expressions, which should you use? The answer is ``regexp 3792constants,'' for several reasons: 3793 3794@itemize @bullet 3795@item 3796String constants are more complicated to write and 3797more difficult to read. Using regexp constants makes your programs 3798less error-prone. Not understanding the difference between the two 3799kinds of constants is a common source of errors. 3800 3801@item 3802It is more efficient to use regexp constants. @command{awk} can note 3803that you have supplied a regexp and store it internally in a form that 3804makes pattern matching more efficient. When using a string constant, 3805@command{awk} must first convert the string into this internal form and 3806then perform the pattern matching. 3807 3808@item 3809Using regexp constants is better form; it shows clearly that you 3810intend a regexp match. 3811@end itemize 3812 3813@c fakenode --- for prepinfo 3814@subheading Advanced Notes: Using @code{\n} in Character Lists of Dynamic Regexps 3815@cindex regular expressions, dynamic, with embedded newlines 3816@cindex newlines, in dynamic regexps 3817 3818Some commercial versions of @command{awk} do not allow the newline 3819character to be used inside a character list for a dynamic regexp: 3820 3821@example 3822$ awk '$0 ~ "[ \t\n]"' 3823@error{} awk: newline in character class [ 3824@error{} ]... 3825@error{} source line number 1 3826@error{} context is 3827@error{} >>> <<< 3828@end example 3829 3830@cindex newlines, in regexp constants 3831But a newline in a regexp constant works with no problem: 3832 3833@example 3834$ awk '$0 ~ /[ \t\n]/' 3835here is a sample line 3836@print{} here is a sample line 3837@kbd{@value{CTL}-d} 3838@end example 3839 3840@command{gawk} does not have this problem, and it isn't likely to 3841occur often in practice, but it's worth noting for future reference. 3842@c ENDOFRANGE dregexp 3843@c ENDOFRANGE regexpd 3844@c ENDOFRANGE regexp 3845 3846@node Locales 3847@section Where You Are Makes A Difference 3848 3849Modern systems support the notion of @dfn{locales}: a way to tell 3850the system about the local character set and language. The current 3851locale setting can affect the way regexp matching works, often 3852in surprising ways. In particular, many locales do case-insensitive 3853matching, even when you may have specified characters of only 3854one particular case. 3855 3856The following example uses the @code{sub} function, which 3857does text replacement 3858(@pxref{String Functions}). 3859Here, the intent is to remove trailing uppercase characters: 3860 3861@example 3862$ echo something1234abc | gawk '@{ sub("[A-Z]*$", ""); print @}' 3863@print{} something1234 3864@end example 3865 3866@noindent 3867This output is unexpected, since the @samp{abc} at the end of @samp{something1234abc} 3868should not normally match @samp{[A-Z]*}. This result is due to the 3869locale setting (and thus you may not see it on your system). 3870There are two fixes. The first is to use the POSIX character 3871class @samp{[[:upper:]]}, instead of @samp{[A-Z]}. 3872The second is to change the locale setting in the environment, 3873before running @command{gawk}, 3874by using the shell statements: 3875 3876@example 3877LANG=C LC_ALL=C 3878export LANG LC_ALL 3879@end example 3880 3881The setting @samp{C} forces @command{gawk} to behave in the traditional 3882Unix manner, where case distinctions do matter. 3883You may wish to put these statements into your shell startup file, 3884e.g., @file{$HOME/.profile}. 3885 3886Similar considerations apply to other ranges. For example, 3887@samp{["-/]} is perfectly valid in ASCII, but is not valid in many 3888Unicode locales, such as @samp{en_US.UTF-8}. (In general, such 3889ranges should be avoided; either list the characters individually, 3890or use a POSIX character class such as @samp{[[:punct:]]}.) 3891 3892For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant. 3893For other single byte record separators, using @samp{LC_ALL=C} will give you 3894much better performance when reading records. Otherwise, @command{gawk} has 3895to make several function calls, @emph{per input character} to find the record 3896terminator. 3897 3898@node Reading Files 3899@chapter Reading Input Files 3900 3901@c STARTOFRANGE infir 3902@cindex input files, reading 3903@cindex input files 3904@cindex @code{FILENAME} variable 3905In the typical @command{awk} program, all input is read either from the 3906standard input (by default, this is the keyboard, but often it is a pipe from another 3907command) or from files whose names you specify on the @command{awk} 3908command line. If you specify input files, @command{awk} reads them 3909in order, processing all the data from one before going on to the next. 3910The name of the current input file can be found in the built-in variable 3911@code{FILENAME} 3912(@pxref{Built-in Variables}). 3913 3914@cindex records 3915@cindex fields 3916The input is read in units called @dfn{records}, and is processed by the 3917rules of your program one record at a time. 3918By default, each record is one line. Each 3919record is automatically split into chunks called @dfn{fields}. 3920This makes it more convenient for programs to work on the parts of a record. 3921 3922@cindex @code{getline} command 3923On rare occasions, you may need to use the @code{getline} command. 3924The @code{getline} command is valuable, both because it 3925can do explicit input from any number of files, and because the files 3926used with it do not have to be named on the @command{awk} command line 3927(@pxref{Getline}). 3928 3929@menu 3930* Records:: Controlling how data is split into records. 3931* Fields:: An introduction to fields. 3932* Nonconstant Fields:: Nonconstant Field Numbers. 3933* Changing Fields:: Changing the Contents of a Field. 3934* Field Separators:: The field separator and how to change it. 3935* Constant Size:: Reading constant width data. 3936* Multiple Line:: Reading multi-line records. 3937* Getline:: Reading files under explicit program control 3938 using the @code{getline} function. 3939@end menu 3940 3941@node Records 3942@section How Input Is Split into Records 3943 3944@c STARTOFRANGE inspl 3945@cindex input, splitting into records 3946@c STARTOFRANGE recspl 3947@cindex records, splitting input into 3948@cindex @code{NR} variable 3949@cindex @code{FNR} variable 3950The @command{awk} utility divides the input for your @command{awk} 3951program into records and fields. 3952@command{awk} keeps track of the number of records that have 3953been read 3954so far 3955from the current input file. This value is stored in a 3956built-in variable called @code{FNR}. It is reset to zero when a new 3957file is started. Another built-in variable, @code{NR}, is the total 3958number of input records read so far from all @value{DF}s. It starts at zero, 3959but is never automatically reset to zero. 3960 3961@cindex separators, for records 3962@cindex record separators 3963Records are separated by a character called the @dfn{record separator}. 3964By default, the record separator is the newline character. 3965This is why records are, by default, single lines. 3966A different character can be used for the record separator by 3967assigning the character to the built-in variable @code{RS}. 3968 3969@cindex newlines, as record separators 3970@cindex @code{RS} variable 3971Like any other variable, 3972the value of @code{RS} can be changed in the @command{awk} program 3973with the assignment operator, @samp{=} 3974(@pxref{Assignment Ops}). 3975The new record-separator character should be enclosed in quotation marks, 3976which indicate a string constant. Often the right time to do this is 3977at the beginning of execution, before any input is processed, 3978so that the very first record is read with the proper separator. 3979To do this, use the special @code{BEGIN} pattern 3980(@pxref{BEGIN/END}). 3981For example: 3982 3983@cindex @code{BEGIN} pattern 3984@example 3985awk 'BEGIN @{ RS = "/" @} 3986 @{ print $0 @}' BBS-list 3987@end example 3988 3989@noindent 3990changes the value of @code{RS} to @code{"/"}, before reading any input. 3991This is a string whose first character is a slash; as a result, records 3992are separated by slashes. Then the input file is read, and the second 3993rule in the @command{awk} program (the action with no pattern) prints each 3994record. Because each @code{print} statement adds a newline at the end of 3995its output, this @command{awk} program copies the input 3996with each slash changed to a newline. Here are the results of running 3997the program on @file{BBS-list}: 3998 3999@example 4000$ awk 'BEGIN @{ RS = "/" @} 4001> @{ print $0 @}' BBS-list 4002@print{} aardvark 555-5553 1200 4003@print{} 300 B 4004@print{} alpo-net 555-3412 2400 4005@print{} 1200 4006@print{} 300 A 4007@print{} barfly 555-7685 1200 4008@print{} 300 A 4009@print{} bites 555-1675 2400 4010@print{} 1200 4011@print{} 300 A 4012@print{} camelot 555-0542 300 C 4013@print{} core 555-2912 1200 4014@print{} 300 C 4015@print{} fooey 555-1234 2400 4016@print{} 1200 4017@print{} 300 B 4018@print{} foot 555-6699 1200 4019@print{} 300 B 4020@print{} macfoo 555-6480 1200 4021@print{} 300 A 4022@print{} sdace 555-3430 2400 4023@print{} 1200 4024@print{} 300 A 4025@print{} sabafoo 555-2127 1200 4026@print{} 300 C 4027@print{} 4028@end example 4029 4030@noindent 4031Note that the entry for the @samp{camelot} BBS is not split. 4032In the original @value{DF} 4033(@pxref{Sample Data Files}), 4034the line looks like this: 4035 4036@example 4037camelot 555-0542 300 C 4038@end example 4039 4040@noindent 4041It has one baud rate only, so there are no slashes in the record, 4042unlike the others which have two or more baud rates. 4043In fact, this record is treated as part of the record 4044for the @samp{core} BBS; the newline separating them in the output 4045is the original newline in the @value{DF}, not the one added by 4046@command{awk} when it printed the record! 4047 4048@cindex record separators, changing 4049@cindex separators, for records 4050Another way to change the record separator is on the command line, 4051using the variable-assignment feature 4052(@pxref{Other Arguments}): 4053 4054@example 4055awk '@{ print $0 @}' RS="/" BBS-list 4056@end example 4057 4058@noindent 4059This sets @code{RS} to @samp{/} before processing @file{BBS-list}. 4060 4061Using an unusual character such as @samp{/} for the record separator 4062produces correct behavior in the vast majority of cases. However, 4063the following (extreme) pipeline prints a surprising @samp{1}: 4064 4065@example 4066$ echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}' 4067@print{} 1 4068@end example 4069 4070There is one field, consisting of a newline. The value of the built-in 4071variable @code{NF} is the number of fields in the current record. 4072 4073@cindex dark corner, input files 4074Reaching the end of an input file terminates the current input record, 4075even if the last character in the file is not the character in @code{RS}. 4076@value{DARKCORNER} 4077 4078@cindex null strings 4079@cindex strings, empty, See null strings 4080The empty string @code{""} (a string without any characters) 4081has a special meaning 4082as the value of @code{RS}. It means that records are separated 4083by one or more blank lines and nothing else. 4084@xref{Multiple Line}, for more details. 4085 4086If you change the value of @code{RS} in the middle of an @command{awk} run, 4087the new value is used to delimit subsequent records, but the record 4088currently being processed, as well as records already processed, are not 4089affected. 4090 4091@cindex @code{RT} variable 4092@cindex records, terminating 4093@cindex terminating records 4094@cindex differences in @command{awk} and @command{gawk}, record separators 4095@cindex regular expressions, as record separators 4096@cindex record separators, regular expressions as 4097@cindex separators, for records, regular expressions as 4098After the end of the record has been determined, @command{gawk} 4099sets the variable @code{RT} to the text in the input that matched 4100@code{RS}. 4101When using @command{gawk}, 4102the value of @code{RS} is not limited to a one-character 4103string. It can be any regular expression 4104(@pxref{Regexp}). 4105In general, each record 4106ends at the next string that matches the regular expression; the next 4107record starts at the end of the matching string. This general rule is 4108actually at work in the usual case, where @code{RS} contains just a 4109newline: a record ends at the beginning of the next matching string (the 4110next newline in the input), and the following record starts just after 4111the end of this string (at the first character of the following line). 4112The newline, because it matches @code{RS}, is not part of either record. 4113 4114When @code{RS} is a single character, @code{RT} 4115contains the same single character. However, when @code{RS} is a 4116regular expression, @code{RT} contains 4117the actual input text that matched the regular expression. 4118 4119The following example illustrates both of these features. 4120It sets @code{RS} equal to a regular expression that 4121matches either a newline or a series of one or more uppercase letters 4122with optional leading and/or trailing whitespace: 4123 4124@example 4125$ echo record 1 AAAA record 2 BBBB record 3 | 4126> gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @} 4127> @{ print "Record =", $0, "and RT =", RT @}' 4128@print{} Record = record 1 and RT = AAAA 4129@print{} Record = record 2 and RT = BBBB 4130@print{} Record = record 3 and RT = 4131@print{} 4132@end example 4133 4134@noindent 4135The final line of output has an extra blank line. This is because the 4136value of @code{RT} is a newline, and the @code{print} statement 4137supplies its own terminating newline. 4138@xref{Simple Sed}, for a more useful example 4139of @code{RS} as a regexp and @code{RT}. 4140 4141If you set @code{RS} to a regular expression that allows optional 4142trailing text, such as @samp{RS = "abc(XYZ)?"} it is possible, due 4143to implementation constraints, that @command{gawk} may match the leading 4144part of the regular expression, but not the trailing part, particularly 4145if the input text that could match the trailing part is fairly long. 4146@command{gawk} attempts to avoid this problem, but currently, there's 4147no guarantee that this will never happen. 4148 4149@cindex differences in @command{awk} and @command{gawk}, @code{RS}/@code{RT} variables 4150The use of @code{RS} as a regular expression and the @code{RT} 4151variable are @command{gawk} extensions; they are not available in 4152compatibility mode 4153(@pxref{Options}). 4154In compatibility mode, only the first character of the value of 4155@code{RS} is used to determine the end of the record. 4156 4157@c fakenode --- for prepinfo 4158@subheading Advanced Notes: @code{RS = "\0"} Is Not Portable 4159 4160@cindex advanced features, @value{DF}s as single record 4161@cindex portability, @value{DF}s as single record 4162There are times when you might want to treat an entire @value{DF} as a 4163single record. The only way to make this happen is to give @code{RS} 4164a value that you know doesn't occur in the input file. This is hard 4165to do in a general way, such that a program always works for arbitrary 4166input files. 4167@c can you say `understatement' boys and girls? 4168 4169You might think that for text files, the @sc{nul} character, which 4170consists of a character with all bits equal to zero, is a good 4171value to use for @code{RS} in this case: 4172 4173@example 4174BEGIN @{ RS = "\0" @} # whole file becomes one record? 4175@end example 4176 4177@cindex differences in @command{awk} and @command{gawk}, strings, storing 4178@command{gawk} in fact accepts this, and uses the @sc{nul} 4179character for the record separator. 4180However, this usage is @emph{not} portable 4181to other @command{awk} implementations. 4182 4183@cindex dark corner, strings, storing 4184All other @command{awk} implementations@footnote{At least that we know 4185about.} store strings internally as C-style strings. C strings use the 4186@sc{nul} character as the string terminator. In effect, this means that 4187@samp{RS = "\0"} is the same as @samp{RS = ""}. 4188@value{DARKCORNER} 4189 4190@cindex records, treating files as 4191@cindex files, as single records 4192The best way to treat a whole file as a single record is to 4193simply read the file in, one record at a time, concatenating each 4194record onto the end of the previous ones. 4195@c ENDOFRANGE inspl 4196@c ENDOFRANGE recspl 4197 4198@node Fields 4199@section Examining Fields 4200 4201@cindex examining fields 4202@cindex fields 4203@cindex accessing fields 4204@c STARTOFRANGE fiex 4205@cindex fields, examining 4206@cindex POSIX @command{awk}, field separators and 4207@cindex field separators, POSIX and 4208@cindex separators, field, POSIX and 4209When @command{awk} reads an input record, the record is 4210automatically @dfn{parsed} or separated by the interpreter into chunks 4211called @dfn{fields}. By default, fields are separated by @dfn{whitespace}, 4212like words in a line. 4213Whitespace in @command{awk} means any string of one or more spaces, 4214tabs, or newlines;@footnote{In POSIX @command{awk}, newlines are not 4215considered whitespace for separating fields.} other characters, such as 4216formfeed, vertical tab, etc.@: that are 4217considered whitespace by other languages, are @emph{not} considered 4218whitespace by @command{awk}. 4219 4220The purpose of fields is to make it more convenient for you to refer to 4221these pieces of the record. You don't have to use them---you can 4222operate on the whole record if you want---but fields are what make 4223simple @command{awk} programs so powerful. 4224 4225@cindex @code{$} field operator 4226@cindex field operator @code{$} 4227@cindex @code{$} (dollar sign), @code{$} field operator 4228@cindex dollar sign (@code{$}), @code{$} field operator 4229@c The comma here does NOT mark a secondary term: 4230@cindex field operators, dollar sign as 4231A dollar-sign (@samp{$}) is used 4232to refer to a field in an @command{awk} program, 4233followed by the number of the field you want. Thus, @code{$1} 4234refers to the first field, @code{$2} to the second, and so on. 4235(Unlike the Unix shells, the field numbers are not limited to single digits. 4236@code{$127} is the one hundred twenty-seventh field in the record.) 4237For example, suppose the following is a line of input: 4238 4239@example 4240This seems like a pretty nice example. 4241@end example 4242 4243@noindent 4244Here the first field, or @code{$1}, is @samp{This}, the second field, or 4245@code{$2}, is @samp{seems}, and so on. Note that the last field, 4246@code{$7}, is @samp{example.}. Because there is no space between the 4247@samp{e} and the @samp{.}, the period is considered part of the seventh 4248field. 4249 4250@cindex @code{NF} variable 4251@cindex fields, number of 4252@code{NF} is a built-in variable whose value is the number of fields 4253in the current record. @command{awk} automatically updates the value 4254of @code{NF} each time it reads a record. No matter how many fields 4255there are, the last field in a record can be represented by @code{$NF}. 4256So, @code{$NF} is the same as @code{$7}, which is @samp{example.}. 4257If you try to reference a field beyond the last 4258one (such as @code{$8} when the record has only seven fields), you get 4259the empty string. (If used in a numeric operation, you get zero.) 4260 4261The use of @code{$0}, which looks like a reference to the ``zero-th'' field, is 4262a special case: it represents the whole input record 4263when you are not interested in specific fields. 4264Here are some more examples: 4265 4266@example 4267$ awk '$1 ~ /foo/ @{ print $0 @}' BBS-list 4268@print{} fooey 555-1234 2400/1200/300 B 4269@print{} foot 555-6699 1200/300 B 4270@print{} macfoo 555-6480 1200/300 A 4271@print{} sabafoo 555-2127 1200/300 C 4272@end example 4273 4274@noindent 4275This example prints each record in the file @file{BBS-list} whose first 4276field contains the string @samp{foo}. The operator @samp{~} is called a 4277@dfn{matching operator} 4278(@pxref{Regexp Usage}); 4279it tests whether a string (here, the field @code{$1}) matches a given regular 4280expression. 4281 4282By contrast, the following example 4283looks for @samp{foo} in @emph{the entire record} and prints the first 4284field and the last field for each matching input record: 4285 4286@example 4287$ awk '/foo/ @{ print $1, $NF @}' BBS-list 4288@print{} fooey B 4289@print{} foot B 4290@print{} macfoo A 4291@print{} sabafoo C 4292@end example 4293@c ENDOFRANGE fiex 4294 4295@node Nonconstant Fields 4296@section Nonconstant Field Numbers 4297@cindex fields, numbers 4298@cindex field numbers 4299 4300The number of a field does not need to be a constant. Any expression in 4301the @command{awk} language can be used after a @samp{$} to refer to a 4302field. The value of the expression specifies the field number. If the 4303value is a string, rather than a number, it is converted to a number. 4304Consider this example: 4305 4306@example 4307awk '@{ print $NR @}' 4308@end example 4309 4310@noindent 4311Recall that @code{NR} is the number of records read so far: one in the 4312first record, two in the second, etc. So this example prints the first 4313field of the first record, the second field of the second record, and so 4314on. For the twentieth record, field number 20 is printed; most likely, 4315the record has fewer than 20 fields, so this prints a blank line. 4316Here is another example of using expressions as field numbers: 4317 4318@example 4319awk '@{ print $(2*2) @}' BBS-list 4320@end example 4321 4322@command{awk} evaluates the expression @samp{(2*2)} and uses 4323its value as the number of the field to print. The @samp{*} sign 4324represents multiplication, so the expression @samp{2*2} evaluates to four. 4325The parentheses are used so that the multiplication is done before the 4326@samp{$} operation; they are necessary whenever there is a binary 4327operator in the field-number expression. This example, then, prints the 4328hours of operation (the fourth field) for every line of the file 4329@file{BBS-list}. (All of the @command{awk} operators are listed, in 4330order of decreasing precedence, in 4331@ref{Precedence}.) 4332 4333If the field number you compute is zero, you get the entire record. 4334Thus, @samp{$(2-2)} has the same value as @code{$0}. Negative field 4335numbers are not allowed; trying to reference one usually terminates 4336the program. (The POSIX standard does not define 4337what happens when you reference a negative field number. @command{gawk} 4338notices this and terminates your program. Other @command{awk} 4339implementations may behave differently.) 4340 4341As mentioned in @ref{Fields}, 4342@command{awk} stores the current record's number of fields in the built-in 4343variable @code{NF} (also @pxref{Built-in Variables}). The expression 4344@code{$NF} is not a special feature---it is the direct consequence of 4345evaluating @code{NF} and using its value as a field number. 4346 4347@node Changing Fields 4348@section Changing the Contents of a Field 4349 4350@c STARTOFRANGE ficon 4351@cindex fields, changing contents of 4352The contents of a field, as seen by @command{awk}, can be changed within an 4353@command{awk} program; this changes what @command{awk} perceives as the 4354current input record. (The actual input is untouched; @command{awk} @emph{never} 4355modifies the input file.) 4356Consider the following example and its output: 4357 4358@example 4359$ awk '@{ nboxes = $3 ; $3 = $3 - 10 4360> print nboxes, $3 @}' inventory-shipped 4361@print{} 25 15 4362@print{} 32 22 4363@print{} 24 14 4364@dots{} 4365@end example 4366 4367@noindent 4368The program first saves the original value of field three in the variable 4369@code{nboxes}. 4370The @samp{-} sign represents subtraction, so this program reassigns 4371field three, @code{$3}, as the original value of field three minus ten: 4372@samp{$3 - 10}. (@xref{Arithmetic Ops}.) 4373Then it prints the original and new values for field three. 4374(Someone in the warehouse made a consistent mistake while inventorying 4375the red boxes.) 4376 4377For this to work, the text in field @code{$3} must make sense 4378as a number; the string of characters must be converted to a number 4379for the computer to do arithmetic on it. The number resulting 4380from the subtraction is converted back to a string of characters that 4381then becomes field three. 4382@xref{Conversion}. 4383 4384When the value of a field is changed (as perceived by @command{awk}), the 4385text of the input record is recalculated to contain the new field where 4386the old one was. In other words, @code{$0} changes to reflect the altered 4387field. Thus, this program 4388prints a copy of the input file, with 10 subtracted from the second 4389field of each line: 4390 4391@example 4392$ awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped 4393@print{} Jan 3 25 15 115 4394@print{} Feb 5 32 24 226 4395@print{} Mar 5 24 34 228 4396@dots{} 4397@end example 4398 4399It is also possible to also assign contents to fields that are out 4400of range. For example: 4401 4402@example 4403$ awk '@{ $6 = ($5 + $4 + $3 + $2) 4404> print $6 @}' inventory-shipped 4405@print{} 168 4406@print{} 297 4407@print{} 301 4408@dots{} 4409@end example 4410 4411@cindex adding, fields 4412@cindex fields, adding 4413@noindent 4414We've just created @code{$6}, whose value is the sum of fields 4415@code{$2}, @code{$3}, @code{$4}, and @code{$5}. The @samp{+} sign 4416represents addition. For the file @file{inventory-shipped}, @code{$6} 4417represents the total number of parcels shipped for a particular month. 4418 4419Creating a new field changes @command{awk}'s internal copy of the current 4420input record, which is the value of @code{$0}. Thus, if you do @samp{print $0} 4421after adding a field, the record printed includes the new field, with 4422the appropriate number of field separators between it and the previously 4423existing fields. 4424 4425@cindex @code{OFS} variable 4426@cindex output field separator, See @code{OFS} variable 4427@cindex field separators, See Also @code{OFS} 4428This recomputation affects and is affected by 4429@code{NF} (the number of fields; @pxref{Fields}). 4430For example, the value of @code{NF} is set to the number of the highest 4431field you create. 4432The exact format of @code{$0} is also affected by a feature that has not been discussed yet: 4433the @dfn{output field separator}, @code{OFS}, 4434used to separate the fields (@pxref{Output Separators}). 4435 4436Note, however, that merely @emph{referencing} an out-of-range field 4437does @emph{not} change the value of either @code{$0} or @code{NF}. 4438Referencing an out-of-range field only produces an empty string. For 4439example: 4440 4441@example 4442if ($(NF+1) != "") 4443 print "can't happen" 4444else 4445 print "everything is normal" 4446@end example 4447 4448@noindent 4449should print @samp{everything is normal}, because @code{NF+1} is certain 4450to be out of range. (@xref{If Statement}, 4451for more information about @command{awk}'s @code{if-else} statements. 4452@xref{Typing and Comparison}, 4453for more information about the @samp{!=} operator.) 4454 4455It is important to note that making an assignment to an existing field 4456changes the 4457value of @code{$0} but does not change the value of @code{NF}, 4458even when you assign the empty string to a field. For example: 4459 4460@example 4461$ echo a b c d | awk '@{ OFS = ":"; $2 = "" 4462> print $0; print NF @}' 4463@print{} a::c:d 4464@print{} 4 4465@end example 4466 4467@noindent 4468The field is still there; it just has an empty value, denoted by 4469the two colons between @samp{a} and @samp{c}. 4470This example shows what happens if you create a new field: 4471 4472@example 4473$ echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new" 4474> print $0; print NF @}' 4475@print{} a::c:d::new 4476@print{} 6 4477@end example 4478 4479@noindent 4480The intervening field, @code{$5}, is created with an empty value 4481(indicated by the second pair of adjacent colons), 4482and @code{NF} is updated with the value six. 4483 4484@c FIXME: Verify that this is in POSIX 4485@cindex dark corner, @code{NF} variable, decrementing 4486@cindex @code{NF} variable, decrementing 4487Decrementing @code{NF} throws away the values of the fields 4488after the new value of @code{NF} and recomputes @code{$0}. 4489@value{DARKCORNER} 4490Here is an example: 4491 4492@example 4493$ echo a b c d e f | awk '@{ print "NF =", NF; 4494> NF = 3; print $0 @}' 4495@print{} NF = 6 4496@print{} a b c 4497@end example 4498 4499@c the comma before decrementing does NOT represent a tertiary entry 4500@cindex portability, @code{NF} variable, decrementing 4501@strong{Caution:} Some versions of @command{awk} don't 4502rebuild @code{$0} when @code{NF} is decremented. Caveat emptor. 4503 4504Finally, there are times when it is convenient to force 4505@command{awk} to rebuild the entire record, using the current 4506value of the fields and @code{OFS}. To do this, use the 4507seemingly innocuous assignment: 4508 4509@example 4510$1 = $1 # force record to be reconstituted 4511print $0 # or whatever else with $0 4512@end example 4513 4514@noindent 4515This forces @command{awk} rebuild the record. It does help 4516to add a comment, as we've shown here. 4517 4518There is a flip side to the relationship between @code{$0} and 4519the fields. Any assignment to @code{$0} causes the record to be 4520reparsed into fields using the @emph{current} value of @code{FS}. 4521This also applies to any built-in function that updates @code{$0}, 4522such as @code{sub} and @code{gsub} 4523(@pxref{String Functions}). 4524@c ENDOFRANGE ficon 4525 4526@node Field Separators 4527@section Specifying How Fields Are Separated 4528 4529@menu 4530* Regexp Field Splitting:: Using regexps as the field separator. 4531* Single Character Fields:: Making each character a separate field. 4532* Command Line Field Separator:: Setting @code{FS} from the command-line. 4533* Field Splitting Summary:: Some final points and a summary table. 4534@end menu 4535 4536@cindex @code{FS} variable 4537@cindex fields, separating 4538@c STARTOFRANGE fisepr 4539@cindex field separators 4540@c STARTOFRANGE fisepg 4541@cindex fields, separating 4542The @dfn{field separator}, which is either a single character or a regular 4543expression, controls the way @command{awk} splits an input record into fields. 4544@command{awk} scans the input record for character sequences that 4545match the separator; the fields themselves are the text between the matches. 4546 4547In the examples that follow, we use the bullet symbol (@bullet{}) to 4548represent spaces in the output. 4549If the field separator is @samp{oo}, then the following line: 4550 4551@example 4552moo goo gai pan 4553@end example 4554 4555@noindent 4556is split into three fields: @samp{m}, @samp{@bullet{}g}, and 4557@samp{@bullet{}gai@bullet{}pan}. 4558Note the leading spaces in the values of the second and third fields. 4559 4560@cindex troubleshooting, @command{awk} uses @code{FS} not @code{IFS} 4561The field separator is represented by the built-in variable @code{FS}. 4562Shell programmers take note: @command{awk} does @emph{not} use the 4563name @code{IFS} that is used by the POSIX-compliant shells (such as 4564the Unix Bourne shell, @command{sh}, or @command{bash}). 4565 4566@cindex @code{FS} variable, changing value of 4567The value of @code{FS} can be changed in the @command{awk} program with the 4568assignment operator, @samp{=} (@pxref{Assignment Ops}). 4569Often the right time to do this is at the beginning of execution 4570before any input has been processed, so that the very first record 4571is read with the proper separator. To do this, use the special 4572@code{BEGIN} pattern 4573(@pxref{BEGIN/END}). 4574For example, here we set the value of @code{FS} to the string 4575@code{","}: 4576 4577@example 4578awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}' 4579@end example 4580 4581@cindex @code{BEGIN} pattern 4582@noindent 4583Given the input line: 4584 4585@example 4586John Q. Smith, 29 Oak St., Walamazoo, MI 42139 4587@end example 4588 4589@noindent 4590this @command{awk} program extracts and prints the string 4591@samp{@bullet{}29@bullet{}Oak@bullet{}St.}. 4592 4593@cindex field separators, choice of 4594@cindex regular expressions as field separators 4595@cindex field separators, regular expressions as 4596Sometimes the input data contains separator characters that don't 4597separate fields the way you thought they would. For instance, the 4598person's name in the example we just used might have a title or 4599suffix attached, such as: 4600 4601@example 4602John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139 4603@end example 4604 4605@noindent 4606The same program would extract @samp{@bullet{}LXIX}, instead of 4607@samp{@bullet{}29@bullet{}Oak@bullet{}St.}. 4608If you were expecting the program to print the 4609address, you would be surprised. The moral is to choose your data layout and 4610separator characters carefully to prevent such problems. 4611(If the data is not in a form that is easy to process, perhaps you 4612can massage it first with a separate @command{awk} program.) 4613 4614@cindex newlines, as field separators 4615@cindex whitespace, as field separators 4616Fields are normally separated by whitespace sequences 4617(spaces, tabs, and newlines), not by single spaces. Two spaces in a row do not 4618delimit an empty field. The default value of the field separator @code{FS} 4619is a string containing a single space, @w{@code{" "}}. If @command{awk} 4620interpreted this value in the usual way, each space character would separate 4621fields, so two spaces in a row would make an empty field between them. 4622The reason this does not happen is that a single space as the value of 4623@code{FS} is a special case---it is taken to specify the default manner 4624of delimiting fields. 4625 4626If @code{FS} is any other single character, such as @code{","}, then 4627each occurrence of that character separates two fields. Two consecutive 4628occurrences delimit an empty field. If the character occurs at the 4629beginning or the end of the line, that too delimits an empty field. The 4630space character is the only single character that does not follow these 4631rules. 4632 4633@node Regexp Field Splitting 4634@subsection Using Regular Expressions to Separate Fields 4635 4636@c STARTOFRANGE regexpfs 4637@cindex regular expressions, as field separators 4638@c STARTOFRANGE fsregexp 4639@cindex field separators, regular expressions as 4640The previous @value{SUBSECTION} 4641discussed the use of single characters or simple strings as the 4642value of @code{FS}. 4643More generally, the value of @code{FS} may be a string containing any 4644regular expression. In this case, each match in the record for the regular 4645expression separates fields. For example, the assignment: 4646 4647@example 4648FS = ", \t" 4649@end example 4650 4651@noindent 4652makes every area of an input line that consists of a comma followed by a 4653space and a TAB into a field separator. 4654@ifinfo 4655(@samp{\t} 4656is an @dfn{escape sequence} that stands for a TAB; 4657@pxref{Escape Sequences}, 4658for the complete list of similar escape sequences.) 4659@end ifinfo 4660 4661For a less trivial example of a regular expression, try using 4662single spaces to separate fields the way single commas are used. 4663@code{FS} can be set to @w{@code{"[@ ]"}} (left bracket, space, right 4664bracket). This regular expression matches a single space and nothing else 4665(@pxref{Regexp}). 4666 4667There is an important difference between the two cases of @samp{FS = @w{" "}} 4668(a single space) and @samp{FS = @w{"[ \t\n]+"}} 4669(a regular expression matching one or more spaces, tabs, or newlines). 4670For both values of @code{FS}, fields are separated by @dfn{runs} 4671(multiple adjacent occurrences) of spaces, tabs, 4672and/or newlines. However, when the value of @code{FS} is @w{@code{" "}}, 4673@command{awk} first strips leading and trailing whitespace from 4674the record and then decides where the fields are. 4675For example, the following pipeline prints @samp{b}: 4676 4677@example 4678$ echo ' a b c d ' | awk '@{ print $2 @}' 4679@print{} b 4680@end example 4681 4682@noindent 4683However, this pipeline prints @samp{a} (note the extra spaces around 4684each letter): 4685 4686@example 4687$ echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t\n]+" @} 4688> @{ print $2 @}' 4689@print{} a 4690@end example 4691 4692@noindent 4693@cindex null strings 4694@cindex strings, null 4695@cindex empty strings, See null strings 4696In this case, the first field is @dfn{null} or empty. 4697 4698The stripping of leading and trailing whitespace also comes into 4699play whenever @code{$0} is recomputed. For instance, study this pipeline: 4700 4701@example 4702$ echo ' a b c d' | awk '@{ print; $2 = $2; print @}' 4703@print{} a b c d 4704@print{} a b c d 4705@end example 4706 4707@noindent 4708The first @code{print} statement prints the record as it was read, 4709with leading whitespace intact. The assignment to @code{$2} rebuilds 4710@code{$0} by concatenating @code{$1} through @code{$NF} together, 4711separated by the value of @code{OFS}. Because the leading whitespace 4712was ignored when finding @code{$1}, it is not part of the new @code{$0}. 4713Finally, the last @code{print} statement prints the new @code{$0}. 4714@c ENDOFRANGE regexpfs 4715@c ENDOFRANGE fsregexp 4716 4717@node Single Character Fields 4718@subsection Making Each Character a Separate Field 4719 4720@cindex differences in @command{awk} and @command{gawk}, single-character fields 4721@cindex single-character fields 4722@cindex fields, single-character 4723There are times when you may want to examine each character 4724of a record separately. This can be done in @command{gawk} by 4725simply assigning the null string (@code{""}) to @code{FS}. In this case, 4726each individual character in the record becomes a separate field. 4727For example: 4728 4729@example 4730$ echo a b | gawk 'BEGIN @{ FS = "" @} 4731> @{ 4732> for (i = 1; i <= NF; i = i + 1) 4733> print "Field", i, "is", $i 4734> @}' 4735@print{} Field 1 is a 4736@print{} Field 2 is 4737@print{} Field 3 is b 4738@end example 4739 4740@cindex dark corner, @code{FS} as null string 4741@cindex FS variable, as null string 4742Traditionally, the behavior of @code{FS} equal to @code{""} was not defined. 4743In this case, most versions of Unix @command{awk} simply treat the entire record 4744as only having one field. 4745@value{DARKCORNER} 4746In compatibility mode 4747(@pxref{Options}), 4748if @code{FS} is the null string, then @command{gawk} also 4749behaves this way. 4750 4751@node Command Line Field Separator 4752@subsection Setting @code{FS} from the Command Line 4753@cindex @code{-F} option 4754@cindex options, command-line 4755@cindex command line, options 4756@cindex field separators, on command line 4757@c The comma before "setting" does NOT represent a tertiary 4758@cindex command line, @code{FS} on, setting 4759@cindex @code{FS} variable, setting from command line 4760 4761@code{FS} can be set on the command line. Use the @option{-F} option to 4762do so. For example: 4763 4764@example 4765awk -F, '@var{program}' @var{input-files} 4766@end example 4767 4768@noindent 4769sets @code{FS} to the @samp{,} character. Notice that the option uses 4770an uppercase @samp{F} instead of a lowercase @samp{f}. The latter 4771option (@option{-f}) specifies a file 4772containing an @command{awk} program. Case is significant in command-line 4773options: 4774the @option{-F} and @option{-f} options have nothing to do with each other. 4775You can use both options at the same time to set the @code{FS} variable 4776@emph{and} get an @command{awk} program from a file. 4777 4778The value used for the argument to @option{-F} is processed in exactly the 4779same way as assignments to the built-in variable @code{FS}. 4780Any special characters in the field separator must be escaped 4781appropriately. For example, to use a @samp{\} as the field separator 4782on the command line, you would have to type: 4783 4784@example 4785# same as FS = "\\" 4786awk -F\\\\ '@dots{}' files @dots{} 4787@end example 4788 4789@noindent 4790@cindex @code{\} (backslash), as field separators 4791@cindex backslash (@code{\}), as field separators 4792Because @samp{\} is used for quoting in the shell, @command{awk} sees 4793@samp{-F\\}. Then @command{awk} processes the @samp{\\} for escape 4794characters (@pxref{Escape Sequences}), finally yielding 4795a single @samp{\} to use for the field separator. 4796 4797@c @cindex historical features 4798As a special case, in compatibility mode 4799(@pxref{Options}), 4800if the argument to @option{-F} is @samp{t}, then @code{FS} is set to 4801the TAB character. If you type @samp{-F\t} at the 4802shell, without any quotes, the @samp{\} gets deleted, so @command{awk} 4803figures that you really want your fields to be separated with tabs and 4804not @samp{t}s. Use @samp{-v FS="t"} or @samp{-F"[t]"} on the command line 4805if you really do want to separate your fields with @samp{t}s. 4806 4807For example, let's use an @command{awk} program file called @file{baud.awk} 4808that contains the pattern @code{/300/} and the action @samp{print $1}: 4809 4810@example 4811/300/ @{ print $1 @} 4812@end example 4813 4814Let's also set @code{FS} to be the @samp{-} character and run the 4815program on the file @file{BBS-list}. The following command prints a 4816list of the names of the bulletin boards that operate at 300 baud and 4817the first three digits of their phone numbers: 4818 4819@c tweaked to make the tex output look better in @smallbook 4820@example 4821$ awk -F- -f baud.awk BBS-list 4822@print{} aardvark 555 4823@print{} alpo 4824@print{} barfly 555 4825@print{} bites 555 4826@print{} camelot 555 4827@print{} core 555 4828@print{} fooey 555 4829@print{} foot 555 4830@print{} macfoo 555 4831@print{} sdace 555 4832@print{} sabafoo 555 4833@end example 4834 4835@noindent 4836Note the second line of output. The second line 4837in the original file looked like this: 4838 4839@example 4840alpo-net 555-3412 2400/1200/300 A 4841@end example 4842 4843The @samp{-} as part of the system's name was used as the field 4844separator, instead of the @samp{-} in the phone number that was 4845originally intended. This demonstrates why you have to be careful in 4846choosing your field and record separators. 4847 4848@c The comma after "password files" does NOT start a tertiary 4849@cindex Unix @command{awk}, password files, field separators and 4850Perhaps the most common use of a single character as the field 4851separator occurs when processing the Unix system password file. 4852On many Unix systems, each user has a separate entry in the system password 4853file, one line per user. The information in these lines is separated 4854by colons. The first field is the user's logon name and the second is 4855the user's (encrypted or shadow) password. A password file entry might look 4856like this: 4857 4858@cindex Robbins, Arnold 4859@example 4860arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/bash 4861@end example 4862 4863The following program searches the system password file and prints 4864the entries for users who have no password: 4865 4866@example 4867awk -F: '$2 == ""' /etc/passwd 4868@end example 4869 4870@node Field Splitting Summary 4871@subsection Field-Splitting Summary 4872 4873It is important to remember that when you assign a string constant 4874as the value of @code{FS}, it undergoes normal @command{awk} string 4875processing. For example, with Unix @command{awk} and @command{gawk}, 4876the assignment @samp{FS = "\.."} assigns the character string @code{".."} 4877to @code{FS} (the backslash is stripped). This creates a regexp meaning 4878``fields are separated by occurrences of any two characters.'' 4879If instead you want fields to be separated by a literal period followed 4880by any single character, use @samp{FS = "\\.."}. 4881 4882The following table summarizes how fields are split, based on the value 4883of @code{FS} (@samp{==} means ``is equal to''): 4884 4885@table @code 4886@item FS == " " 4887Fields are separated by runs of whitespace. Leading and trailing 4888whitespace are ignored. This is the default. 4889 4890@item FS == @var{any other single character} 4891Fields are separated by each occurrence of the character. Multiple 4892successive occurrences delimit empty fields, as do leading and 4893trailing occurrences. 4894The character can even be a regexp metacharacter; it does not need 4895to be escaped. 4896 4897@item FS == @var{regexp} 4898Fields are separated by occurrences of characters that match @var{regexp}. 4899Leading and trailing matches of @var{regexp} delimit empty fields. 4900 4901@item FS == "" 4902Each individual character in the record becomes a separate field. 4903(This is a @command{gawk} extension; it is not specified by the 4904POSIX standard.) 4905@end table 4906 4907@c fakenode --- for prepinfo 4908@subheading Advanced Notes: Changing @code{FS} Does Not Affect the Fields 4909 4910@cindex POSIX @command{awk}, field separators and 4911@cindex field separators, POSIX and 4912According to the POSIX standard, @command{awk} is supposed to behave 4913as if each record is split into fields at the time it is read. 4914In particular, this means that if you change the value of @code{FS} 4915after a record is read, the value of the fields (i.e., how they were split) 4916should reflect the old value of @code{FS}, not the new one. 4917 4918@cindex dark corner, field separators 4919@cindex @command{sed} utility 4920@cindex stream editors 4921However, many implementations of @command{awk} do not work this way. Instead, 4922they defer splitting the fields until a field is actually 4923referenced. The fields are split 4924using the @emph{current} value of @code{FS}! 4925@value{DARKCORNER} 4926This behavior can be difficult 4927to diagnose. The following example illustrates the difference 4928between the two methods. 4929(The @command{sed}@footnote{The @command{sed} utility is a ``stream editor.'' 4930Its behavior is also defined by the POSIX standard.} 4931command prints just the first line of @file{/etc/passwd}.) 4932 4933@example 4934sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}' 4935@end example 4936 4937@noindent 4938which usually prints: 4939 4940@example 4941root 4942@end example 4943 4944@noindent 4945on an incorrect implementation of @command{awk}, while @command{gawk} 4946prints something like: 4947 4948@example 4949root:nSijPlPhZZwgE:0:0:Root:/: 4950@end example 4951 4952@c fakenode --- for prepinfo 4953@subheading Advanced Notes: @code{FS} and @code{IGNORECASE} 4954 4955The @code{IGNORECASE} variable 4956(@pxref{User-modified}) 4957affects field splitting @emph{only} when the value of @code{FS} is a regexp. 4958It has no effect when @code{FS} is a single character, even if 4959that character is a letter. Thus, in the following code: 4960 4961@example 4962FS = "c" 4963IGNORECASE = 1 4964$0 = "aCa" 4965print $1 4966@end example 4967 4968@noindent 4969The output is @samp{aCa}. If you really want to split fields on an 4970alphabetic character while ignoring case, use a regexp that will 4971do it for you. E.g., @samp{FS = "[c]"}. In this case, @code{IGNORECASE} 4972will take effect. 4973 4974@c ENDOFRANGE fisepr 4975@c ENDOFRANGE fisepg 4976 4977@node Constant Size 4978@section Reading Fixed-Width Data 4979 4980@ifnotinfo 4981@strong{Note:} This @value{SECTION} discusses an advanced 4982feature of @command{gawk}. If you are a novice @command{awk} user, 4983you might want to skip it on the first reading. 4984@end ifnotinfo 4985 4986@ifinfo 4987(This @value{SECTION} discusses an advanced feature of @command{awk}. 4988If you are a novice @command{awk} user, you might want to skip it on 4989the first reading.) 4990@end ifinfo 4991 4992@cindex data, fixed-width 4993@cindex fixed-width data 4994@cindex advanced features, fixed-width data 4995@command{gawk} @value{PVERSION} 2.13 introduced a facility for dealing with 4996fixed-width fields with no distinctive field separator. For example, 4997data of this nature arises in the input for old Fortran programs where 4998numbers are run together, or in the output of programs that did not 4999anticipate the use of their output as input for other programs. 5000 5001An example of the latter is a table where all the columns are lined up by 5002the use of a variable number of spaces and @emph{empty fields are just 5003spaces}. Clearly, @command{awk}'s normal field splitting based on @code{FS} 5004does not work well in this case. Although a portable @command{awk} program 5005can use a series of @code{substr} calls on @code{$0} 5006(@pxref{String Functions}), 5007this is awkward and inefficient for a large number of fields. 5008 5009@c comma before specifying is part of tertiary 5010@cindex troubleshooting, fatal errors, field widths, specifying 5011@cindex @command{w} utility 5012@cindex @code{FIELDWIDTHS} variable 5013The splitting of an input record into fixed-width fields is specified by 5014assigning a string containing space-separated numbers to the built-in 5015variable @code{FIELDWIDTHS}. Each number specifies the width of the field, 5016@emph{including} columns between fields. If you want to ignore the columns 5017between fields, you can specify the width as a separate field that is 5018subsequently ignored. 5019It is a fatal error to supply a field width that is not a positive number. 5020The following data is the output of the Unix @command{w} utility. It is useful 5021to illustrate the use of @code{FIELDWIDTHS}: 5022 5023@example 5024@group 5025 10:06pm up 21 days, 14:04, 23 users 5026User tty login@ idle JCPU PCPU what 5027hzuo ttyV0 8:58pm 9 5 vi p24.tex 5028hzang ttyV3 6:37pm 50 -csh 5029eklye ttyV5 9:53pm 7 1 em thes.tex 5030dportein ttyV6 8:17pm 1:47 -csh 5031gierd ttyD3 10:00pm 1 elm 5032dave ttyD4 9:47pm 4 4 w 5033brent ttyp0 26Jun91 4:46 26:46 4:41 bash 5034dave ttyq4 26Jun9115days 46 46 wnewmail 5035@end group 5036@end example 5037 5038The following program takes the above input, converts the idle time to 5039number of seconds, and prints out the first two fields and the calculated 5040idle time: 5041 5042@strong{Note:} 5043This program uses a number of @command{awk} features that 5044haven't been introduced yet. 5045 5046@example 5047BEGIN @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @} 5048NR > 2 @{ 5049 idle = $4 5050 sub(/^ */, "", idle) # strip leading spaces 5051 if (idle == "") 5052 idle = 0 5053 if (idle ~ /:/) @{ 5054 split(idle, t, ":") 5055 idle = t[1] * 60 + t[2] 5056 @} 5057 if (idle ~ /days/) 5058 idle *= 24 * 60 * 60 5059 5060 print $1, $2, idle 5061@} 5062@end example 5063 5064Running the program on the data produces the following results: 5065 5066@example 5067hzuo ttyV0 0 5068hzang ttyV3 50 5069eklye ttyV5 0 5070dportein ttyV6 107 5071gierd ttyD3 1 5072dave ttyD4 0 5073brent ttyp0 286 5074dave ttyq4 1296000 5075@end example 5076 5077Another (possibly more practical) example of fixed-width input data 5078is the input from a deck of balloting cards. In some parts of 5079the United States, voters mark their choices by punching holes in computer 5080cards. These cards are then processed to count the votes for any particular 5081candidate or on any particular issue. Because a voter may choose not to 5082vote on some issue, any column on the card may be empty. An @command{awk} 5083program for processing such data could use the @code{FIELDWIDTHS} feature 5084to simplify reading the data. (Of course, getting @command{gawk} to run on 5085a system with card readers is another story!) 5086 5087@ignore 5088Exercise: Write a ballot card reading program 5089@end ignore 5090 5091@cindex @command{gawk}, splitting fields and 5092Assigning a value to @code{FS} causes @command{gawk} to use 5093@code{FS} for field splitting again. Use @samp{FS = FS} to make this happen, 5094without having to know the current value of @code{FS}. 5095In order to tell which kind of field splitting is in effect, 5096use @code{PROCINFO["FS"]} 5097(@pxref{Auto-set}). 5098The value is @code{"FS"} if regular field splitting is being used, 5099or it is @code{"FIELDWIDTHS"} if fixed-width field splitting is being used: 5100 5101@example 5102if (PROCINFO["FS"] == "FS") 5103 @var{regular field splitting} @dots{} 5104else 5105 @var{fixed-width field splitting} @dots{} 5106@end example 5107 5108This information is useful when writing a function 5109that needs to temporarily change @code{FS} or @code{FIELDWIDTHS}, 5110read some records, and then restore the original settings 5111(@pxref{Passwd Functions}, 5112for an example of such a function). 5113 5114@node Multiple Line 5115@section Multiple-Line Records 5116 5117@c STARTOFRANGE recm 5118@cindex records, multiline 5119@c STARTOFRANGE imr 5120@cindex input, multiline records 5121@c STARTOFRANGE frm 5122@cindex files, reading, multiline records 5123@cindex input, files, See input files 5124In some databases, a single line cannot conveniently hold all the 5125information in one entry. In such cases, you can use multiline 5126records. The first step in doing this is to choose your data format. 5127 5128@cindex record separators, with multiline records 5129One technique is to use an unusual character or string to separate 5130records. For example, you could use the formfeed character (written 5131@samp{\f} in @command{awk}, as in C) to separate them, making each record 5132a page of the file. To do this, just set the variable @code{RS} to 5133@code{"\f"} (a string containing the formfeed character). Any 5134other character could equally well be used, as long as it won't be part 5135of the data in a record. 5136 5137@cindex @code{RS} variable, multiline records and 5138Another technique is to have blank lines separate records. By a special 5139dispensation, an empty string as the value of @code{RS} indicates that 5140records are separated by one or more blank lines. When @code{RS} is set 5141to the empty string, each record always ends at the first blank line 5142encountered. The next record doesn't start until the first nonblank 5143line that follows. No matter how many blank lines appear in a row, they 5144all act as one record separator. 5145(Blank lines must be completely empty; lines that contain only 5146whitespace do not count.) 5147 5148@cindex leftmost longest match 5149@cindex matching, leftmost longest 5150You can achieve the same effect as @samp{RS = ""} by assigning the 5151string @code{"\n\n+"} to @code{RS}. This regexp matches the newline 5152at the end of the record and one or more blank lines after the record. 5153In addition, a regular expression always matches the longest possible 5154sequence when there is a choice 5155(@pxref{Leftmost Longest}). 5156So the next record doesn't start until 5157the first nonblank line that follows---no matter how many blank lines 5158appear in a row, they are considered one record separator. 5159 5160@cindex dark corner, multiline records 5161There is an important difference between @samp{RS = ""} and 5162@samp{RS = "\n\n+"}. In the first case, leading newlines in the input 5163@value{DF} are ignored, and if a file ends without extra blank lines 5164after the last record, the final newline is removed from the record. 5165In the second case, this special processing is not done. 5166@value{DARKCORNER} 5167 5168@cindex field separators, in multiline records 5169Now that the input is separated into records, the second step is to 5170separate the fields in the record. One way to do this is to divide each 5171of the lines into fields in the normal manner. This happens by default 5172as the result of a special feature. When @code{RS} is set to the empty 5173string, @emph{and} @code{FS} is a set to a single character, 5174the newline character @emph{always} acts as a field separator. 5175This is in addition to whatever field separations result from 5176@code{FS}.@footnote{When @code{FS} is the null string (@code{""}) 5177or a regexp, this special feature of @code{RS} does not apply. 5178It does apply to the default field separator of a single space: 5179@samp{FS = " "}.} 5180 5181The original motivation for this special exception was probably to provide 5182useful behavior in the default case (i.e., @code{FS} is equal 5183to @w{@code{" "}}). This feature can be a problem if you really don't 5184want the newline character to separate fields, because there is no way to 5185prevent it. However, you can work around this by using the @code{split} 5186function to break up the record manually 5187(@pxref{String Functions}). 5188If you have a single character field separator, you can work around 5189the special feature in a different way, by making @code{FS} into a 5190regexp for that single character. For example, if the field 5191separator is a percent character, instead of 5192@samp{FS = "%"}, use @samp{FS = "[%]"}. 5193 5194Another way to separate fields is to 5195put each field on a separate line: to do this, just set the 5196variable @code{FS} to the string @code{"\n"}. (This single 5197character seperator matches a single newline.) 5198A practical example of a @value{DF} organized this way might be a mailing 5199list, where each entry is separated by blank lines. Consider a mailing 5200list in a file named @file{addresses}, which looks like this: 5201 5202@example 5203Jane Doe 5204123 Main Street 5205Anywhere, SE 12345-6789 5206 5207John Smith 5208456 Tree-lined Avenue 5209Smallville, MW 98765-4321 5210@dots{} 5211@end example 5212 5213@noindent 5214A simple program to process this file is as follows: 5215 5216@example 5217# addrs.awk --- simple mailing list program 5218 5219# Records are separated by blank lines. 5220# Each line is one field. 5221BEGIN @{ RS = "" ; FS = "\n" @} 5222 5223@{ 5224 print "Name is:", $1 5225 print "Address is:", $2 5226 print "City and State are:", $3 5227 print "" 5228@} 5229@end example 5230 5231Running the program produces the following output: 5232 5233@example 5234$ awk -f addrs.awk addresses 5235@print{} Name is: Jane Doe 5236@print{} Address is: 123 Main Street 5237@print{} City and State are: Anywhere, SE 12345-6789 5238@print{} 5239@print{} Name is: John Smith 5240@print{} Address is: 456 Tree-lined Avenue 5241@print{} City and State are: Smallville, MW 98765-4321 5242@print{} 5243@dots{} 5244@end example 5245 5246@xref{Labels Program}, for a more realistic 5247program that deals with address lists. 5248The following 5249table 5250summarizes how records are split, based on the 5251value of 5252@ifinfo 5253@code{RS}. 5254(@samp{==} means ``is equal to.'') 5255@end ifinfo 5256@ifnotinfo 5257@code{RS}: 5258@end ifnotinfo 5259 5260@table @code 5261@item RS == "\n" 5262Records are separated by the newline character (@samp{\n}). In effect, 5263every line in the @value{DF} is a separate record, including blank lines. 5264This is the default. 5265 5266@item RS == @var{any single character} 5267Records are separated by each occurrence of the character. Multiple 5268successive occurrences delimit empty records. 5269 5270@item RS == "" 5271Records are separated by runs of blank lines. The newline character 5272always serves as a field separator, in addition to whatever value 5273@code{FS} may have. Leading and trailing newlines in a file are ignored. 5274 5275@item RS == @var{regexp} 5276Records are separated by occurrences of characters that match @var{regexp}. 5277Leading and trailing matches of @var{regexp} delimit empty records. 5278(This is a @command{gawk} extension; it is not specified by the 5279POSIX standard.) 5280@end table 5281 5282@cindex @code{RT} variable 5283In all cases, @command{gawk} sets @code{RT} to the input text that matched the 5284value specified by @code{RS}. 5285@c ENDOFRANGE recm 5286@c ENDOFRANGE imr 5287@c ENDOFRANGE frm 5288 5289@node Getline 5290@section Explicit Input with @code{getline} 5291 5292@c STARTOFRANGE getl 5293@cindex @code{getline} command, explicit input with 5294@cindex input, explicit 5295So far we have been getting our input data from @command{awk}'s main 5296input stream---either the standard input (usually your terminal, sometimes 5297the output from another program) or from the 5298files specified on the command line. The @command{awk} language has a 5299special built-in command called @code{getline} that 5300can be used to read input under your explicit control. 5301 5302The @code{getline} command is used in several different ways and should 5303@emph{not} be used by beginners. 5304The examples that follow the explanation of the @code{getline} command 5305include material that has not been covered yet. Therefore, come back 5306and study the @code{getline} command @emph{after} you have reviewed the 5307rest of this @value{DOCUMENT} and have a good knowledge of how @command{awk} works. 5308 5309@cindex @code{ERRNO} variable 5310@cindex differences in @command{awk} and @command{gawk}, @code{getline} command 5311@cindex @code{getline} command, return values 5312The @code{getline} command returns one if it finds a record and zero if 5313it encounters the end of the file. If there is some error in getting 5314a record, such as a file that cannot be opened, then @code{getline} 5315returns @minus{}1. In this case, @command{gawk} sets the variable 5316@code{ERRNO} to a string describing the error that occurred. 5317 5318In the following examples, @var{command} stands for a string value that 5319represents a shell command. 5320 5321@menu 5322* Plain Getline:: Using @code{getline} with no arguments. 5323* Getline/Variable:: Using @code{getline} into a variable. 5324* Getline/File:: Using @code{getline} from a file. 5325* Getline/Variable/File:: Using @code{getline} into a variable from a 5326 file. 5327* Getline/Pipe:: Using @code{getline} from a pipe. 5328* Getline/Variable/Pipe:: Using @code{getline} into a variable from a 5329 pipe. 5330* Getline/Coprocess:: Using @code{getline} from a coprocess. 5331* Getline/Variable/Coprocess:: Using @code{getline} into a variable from a 5332 coprocess. 5333* Getline Notes:: Important things to know about @code{getline}. 5334* Getline Summary:: Summary of @code{getline} Variants. 5335@end menu 5336 5337@node Plain Getline 5338@subsection Using @code{getline} with No Arguments 5339 5340The @code{getline} command can be used without arguments to read input 5341from the current input file. All it does in this case is read the next 5342input record and split it up into fields. This is useful if you've 5343finished processing the current record, but want to do some special 5344processing on the next record @emph{right now}. For example: 5345 5346@example 5347@{ 5348 if ((t = index($0, "/*")) != 0) @{ 5349 # value of `tmp' will be "" if t is 1 5350 tmp = substr($0, 1, t - 1) 5351 u = index(substr($0, t + 2), "*/") 5352 while (u == 0) @{ 5353 if (getline <= 0) @{ 5354 m = "unexpected EOF or error" 5355 m = (m ": " ERRNO) 5356 print m > "/dev/stderr" 5357 exit 5358 @} 5359 t = -1 5360 u = index($0, "*/") 5361 @} 5362 # substr expression will be "" if */ 5363 # occurred at end of line 5364 $0 = tmp substr($0, u + 2) 5365 @} 5366 print $0 5367@} 5368@end example 5369 5370This @command{awk} program deletes all C-style comments (@samp{/* @dots{} 5371*/}) from the input. By replacing the @samp{print $0} with other 5372statements, you could perform more complicated processing on the 5373decommented input, such as searching for matches of a regular 5374expression. (This program has a subtle problem---it does not work if one 5375comment ends and another begins on the same line.) 5376 5377@ignore 5378Exercise, 5379write a program that does handle multiple comments on the line. 5380@end ignore 5381 5382This form of the @code{getline} command sets @code{NF}, 5383@code{NR}, @code{FNR}, and the value of @code{$0}. 5384 5385@strong{Note:} The new value of @code{$0} is used to test 5386the patterns of any subsequent rules. The original value 5387of @code{$0} that triggered the rule that executed @code{getline} 5388is lost. 5389By contrast, the @code{next} statement reads a new record 5390but immediately begins processing it normally, starting with the first 5391rule in the program. @xref{Next Statement}. 5392 5393@node Getline/Variable 5394@subsection Using @code{getline} into a Variable 5395@c comma before using is NOT for tertiary 5396@cindex variables, @code{getline} command into, using 5397 5398You can use @samp{getline @var{var}} to read the next record from 5399@command{awk}'s input into the variable @var{var}. No other processing is 5400done. 5401For example, suppose the next line is a comment or a special string, 5402and you want to read it without triggering 5403any rules. This form of @code{getline} allows you to read that line 5404and store it in a variable so that the main 5405read-a-line-and-check-each-rule loop of @command{awk} never sees it. 5406The following example swaps every two lines of input: 5407 5408@example 5409@{ 5410 if ((getline tmp) > 0) @{ 5411 print tmp 5412 print $0 5413 @} else 5414 print $0 5415@} 5416@end example 5417 5418@noindent 5419It takes the following list: 5420 5421@example 5422wan 5423tew 5424free 5425phore 5426@end example 5427 5428@noindent 5429and produces these results: 5430 5431@example 5432tew 5433wan 5434phore 5435free 5436@end example 5437 5438The @code{getline} command used in this way sets only the variables 5439@code{NR} and @code{FNR} (and of course, @var{var}). The record is not 5440split into fields, so the values of the fields (including @code{$0}) and 5441the value of @code{NF} do not change. 5442 5443@node Getline/File 5444@subsection Using @code{getline} from a File 5445 5446@cindex input redirection 5447@cindex redirection of input 5448@cindex @code{<} (left angle bracket), @code{<} operator (I/O) 5449@cindex left angle bracket (@code{<}), @code{<} operator (I/O) 5450@cindex operators, input/output 5451Use @samp{getline < @var{file}} to read the next record from @var{file}. 5452Here @var{file} is a string-valued expression that 5453specifies the @value{FN}. @samp{< @var{file}} is called a @dfn{redirection} 5454because it directs input to come from a different place. 5455For example, the following 5456program reads its input record from the file @file{secondary.input} when it 5457encounters a first field with a value equal to 10 in the current input 5458file: 5459 5460@example 5461@{ 5462 if ($1 == 10) @{ 5463 getline < "secondary.input" 5464 print 5465 @} else 5466 print 5467@} 5468@end example 5469 5470Because the main input stream is not used, the values of @code{NR} and 5471@code{FNR} are not changed. However, the record it reads is split into fields in 5472the normal manner, so the values of @code{$0} and the other fields are 5473changed, resulting in a new value of @code{NF}. 5474 5475@cindex POSIX @command{awk}, @code{<} operator and 5476@c Thanks to Paul Eggert for initial wording here 5477According to POSIX, @samp{getline < @var{expression}} is ambiguous if 5478@var{expression} contains unparenthesized operators other than 5479@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous 5480because the concatenation operator is not parenthesized. You should 5481write it as @samp{getline < (dir "/" file)} if you want your program 5482to be portable to other @command{awk} implementations. 5483 5484@node Getline/Variable/File 5485@subsection Using @code{getline} into a Variable from a File 5486@c comma before using is NOT for tertiary 5487@cindex variables, @code{getline} command into, using 5488 5489Use @samp{getline @var{var} < @var{file}} to read input 5490from the file 5491@var{file}, and put it in the variable @var{var}. As above, @var{file} 5492is a string-valued expression that specifies the file from which to read. 5493 5494In this version of @code{getline}, none of the built-in variables are 5495changed and the record is not split into fields. The only variable 5496changed is @var{var}. 5497For example, the following program copies all the input files to the 5498output, except for records that say @w{@samp{@@include @var{filename}}}. 5499Such a record is replaced by the contents of the file 5500@var{filename}: 5501 5502@example 5503@{ 5504 if (NF == 2 && $1 == "@@include") @{ 5505 while ((getline line < $2) > 0) 5506 print line 5507 close($2) 5508 @} else 5509 print 5510@} 5511@end example 5512 5513Note here how the name of the extra input file is not built into 5514the program; it is taken directly from the data, specifically from the second field on 5515the @samp{@@include} line. 5516 5517@cindex @code{close} function 5518The @code{close} function is called to ensure that if two identical 5519@samp{@@include} lines appear in the input, the entire specified file is 5520included twice. 5521@xref{Close Files And Pipes}. 5522 5523One deficiency of this program is that it does not process nested 5524@samp{@@include} statements 5525(i.e., @samp{@@include} statements in included files) 5526the way a true macro preprocessor would. 5527@xref{Igawk Program}, for a program 5528that does handle nested @samp{@@include} statements. 5529 5530@node Getline/Pipe 5531@subsection Using @code{getline} from a Pipe 5532 5533@cindex @code{|} (vertical bar), @code{|} operator (I/O) 5534@cindex vertical bar (@code{|}), @code{|} operator (I/O) 5535@cindex input pipeline 5536@cindex pipes, input 5537@cindex operators, input/output 5538The output of a command can also be piped into @code{getline}, using 5539@samp{@var{command} | getline}. In 5540this case, the string @var{command} is run as a shell command and its output 5541is piped into @command{awk} to be used as input. This form of @code{getline} 5542reads one record at a time from the pipe. 5543For example, the following program copies its input to its output, except for 5544lines that begin with @samp{@@execute}, which are replaced by the output 5545produced by running the rest of the line as a shell command: 5546 5547@example 5548@{ 5549 if ($1 == "@@execute") @{ 5550 tmp = substr($0, 10) 5551 while ((tmp | getline) > 0) 5552 print 5553 close(tmp) 5554 @} else 5555 print 5556@} 5557@end example 5558 5559@noindent 5560@cindex @code{close} function 5561The @code{close} function is called to ensure that if two identical 5562@samp{@@execute} lines appear in the input, the command is run for 5563each one. 5564@ifnottex 5565@xref{Close Files And Pipes}. 5566@end ifnottex 5567@c Exercise!! 5568@c This example is unrealistic, since you could just use system 5569Given the input: 5570 5571@example 5572foo 5573bar 5574baz 5575@@execute who 5576bletch 5577@end example 5578 5579@noindent 5580the program might produce: 5581 5582@cindex Robbins, Bill 5583@cindex Robbins, Miriam 5584@cindex Robbins, Arnold 5585@example 5586foo 5587bar 5588baz 5589arnold ttyv0 Jul 13 14:22 5590miriam ttyp0 Jul 13 14:23 (murphy:0) 5591bill ttyp1 Jul 13 14:23 (murphy:0) 5592bletch 5593@end example 5594 5595@noindent 5596Notice that this program ran the command @command{who} and printed the previous result. 5597(If you try this program yourself, you will of course get different results, 5598depending upon who is logged in on your system.) 5599 5600This variation of @code{getline} splits the record into fields, sets the 5601value of @code{NF}, and recomputes the value of @code{$0}. The values of 5602@code{NR} and @code{FNR} are not changed. 5603 5604@cindex POSIX @command{awk}, @code{|} I/O operator and 5605@c Thanks to Paul Eggert for initial wording here 5606According to POSIX, @samp{@var{expression} | getline} is ambiguous if 5607@var{expression} contains unparenthesized operators other than 5608@samp{$}---for example, @samp{@w{"echo "} "date" | getline} is ambiguous 5609because the concatenation operator is not parenthesized. You should 5610write it as @samp{(@w{"echo "} "date") | getline} if you want your program 5611to be portable to other @command{awk} implementations. 5612 5613@node Getline/Variable/Pipe 5614@subsection Using @code{getline} into a Variable from a Pipe 5615@c comma before using is NOT for tertiary 5616@cindex variables, @code{getline} command into, using 5617 5618When you use @samp{@var{command} | getline @var{var}}, the 5619output of @var{command} is sent through a pipe to 5620@code{getline} and into the variable @var{var}. For example, the 5621following program reads the current date and time into the variable 5622@code{current_time}, using the @command{date} utility, and then 5623prints it: 5624 5625@example 5626BEGIN @{ 5627 "date" | getline current_time 5628 close("date") 5629 print "Report printed on " current_time 5630@} 5631@end example 5632 5633In this version of @code{getline}, none of the built-in variables are 5634changed and the record is not split into fields. 5635 5636@ifinfo 5637@c Thanks to Paul Eggert for initial wording here 5638According to POSIX, @samp{@var{expression} | getline @var{var}} is ambiguous if 5639@var{expression} contains unparenthesized operators other than 5640@samp{$}; for example, @samp{@w{"echo "} "date" | getline @var{var}} is ambiguous 5641because the concatenation operator is not parenthesized. You should 5642write it as @samp{(@w{"echo "} "date") | getline @var{var}} if you want your 5643program to be portable to other @command{awk} implementations. 5644@end ifinfo 5645 5646@node Getline/Coprocess 5647@subsection Using @code{getline} from a Coprocess 5648@cindex coprocesses, @code{getline} from 5649@c comma before using is NOT for tertiary 5650@cindex @code{getline} command, coprocesses, using from 5651@cindex @code{|} (vertical bar), @code{|&} operator (I/O) 5652@cindex vertical bar (@code{|}), @code{|&} operator (I/O) 5653@cindex operators, input/output 5654@cindex differences in @command{awk} and @command{gawk}, input/output operators 5655 5656Input into @code{getline} from a pipe is a one-way operation. 5657The command that is started with @samp{@var{command} | getline} only 5658sends data @emph{to} your @command{awk} program. 5659 5660On occasion, you might want to send data to another program 5661for processing and then read the results back. 5662@command{gawk} allows you start a @dfn{coprocess}, with which two-way 5663communications are possible. This is done with the @samp{|&} 5664operator. 5665Typically, you write data to the coprocess first and then 5666read results back, as shown in the following: 5667 5668@example 5669print "@var{some query}" |& "db_server" 5670"db_server" |& getline 5671@end example 5672 5673@noindent 5674which sends a query to @command{db_server} and then reads the results. 5675 5676The values of @code{NR} and 5677@code{FNR} are not changed, 5678because the main input stream is not used. 5679However, the record is split into fields in 5680the normal manner, thus changing the values of @code{$0}, of the other fields, 5681and of @code{NF}. 5682 5683Coprocesses are an advanced feature. They are discussed here only because 5684this is the @value{SECTION} on @code{getline}. 5685@xref{Two-way I/O}, 5686where coprocesses are discussed in more detail. 5687 5688@node Getline/Variable/Coprocess 5689@subsection Using @code{getline} into a Variable from a Coprocess 5690@c comma before using is NOT for tertiary 5691@cindex variables, @code{getline} command into, using 5692 5693When you use @samp{@var{command} |& getline @var{var}}, the output from 5694the coprocess @var{command} is sent through a two-way pipe to @code{getline} 5695and into the variable @var{var}. 5696 5697In this version of @code{getline}, none of the built-in variables are 5698changed and the record is not split into fields. The only variable 5699changed is @var{var}. 5700 5701@ifinfo 5702Coprocesses are an advanced feature. They are discussed here only because 5703this is the @value{SECTION} on @code{getline}. 5704@xref{Two-way I/O}, 5705where coprocesses are discussed in more detail. 5706@end ifinfo 5707 5708@node Getline Notes 5709@subsection Points to Remember About @code{getline} 5710Here are some miscellaneous points about @code{getline} that 5711you should bear in mind: 5712 5713@itemize @bullet 5714@item 5715When @code{getline} changes the value of @code{$0} and @code{NF}, 5716@command{awk} does @emph{not} automatically jump to the start of the 5717program and start testing the new record against every pattern. 5718However, the new record is tested against any subsequent rules. 5719 5720@cindex differences in @command{awk} and @command{gawk}, implementation limitations 5721@cindex implementation issues, @command{gawk}, limits 5722@cindex @command{awk}, implementations, limits 5723@cindex @command{gawk}, implementation issues, limits 5724@item 5725Many @command{awk} implementations limit the number of pipelines that an @command{awk} 5726program may have open to just one. In @command{gawk}, there is no such limit. 5727You can open as many pipelines (and coprocesses) as the underlying operating 5728system permits. 5729 5730@cindex side effects, @code{FILENAME} variable 5731@c The comma before "setting with" does NOT represent a tertiary 5732@cindex @code{FILENAME} variable, @code{getline}, setting with 5733@cindex dark corner, @code{FILENAME} variable 5734@cindex @code{getline} command, @code{FILENAME} variable and 5735@cindex @code{BEGIN} pattern, @code{getline} and 5736@item 5737An interesting side effect occurs if you use @code{getline} without a 5738redirection inside a @code{BEGIN} rule. Because an unredirected @code{getline} 5739reads from the command-line @value{DF}s, the first @code{getline} command 5740causes @command{awk} to set the value of @code{FILENAME}. Normally, 5741@code{FILENAME} does not have a value inside @code{BEGIN} rules, because you 5742have not yet started to process the command-line @value{DF}s. 5743@value{DARKCORNER} 5744(@xref{BEGIN/END}, 5745also @pxref{Auto-set}.) 5746 5747@item 5748Using @code{FILENAME} with @code{getline} 5749(@samp{getline < FILENAME}) 5750is likely to be a source for 5751confusion. @command{awk} opens a separate input stream from the 5752current input file. However, by not using a variable, @code{$0} 5753and @code{NR} are still updated. If you're doing this, it's 5754probably by accident, and you should reconsider what it is you're 5755trying to accomplish. 5756@end itemize 5757 5758@node Getline Summary 5759@subsection Summary of @code{getline} Variants 5760@cindex @code{getline} command, variants 5761 5762The following table summarizes the eight variants of @code{getline}, 5763listing which built-in variables are set by each one. 5764 5765@multitable {@var{command} @code{|& getline} @var{var}} {1234567890123456789012345678901234567890} 5766@item @code{getline} @tab Sets @code{$0}, @code{NF}, @code{FNR}, and @code{NR} 5767 5768@item @code{getline} @var{var} @tab Sets @var{var}, @code{FNR}, and @code{NR} 5769 5770@item @code{getline <} @var{file} @tab Sets @code{$0} and @code{NF} 5771 5772@item @code{getline @var{var} < @var{file}} @tab Sets @var{var} 5773 5774@item @var{command} @code{| getline} @tab Sets @code{$0} and @code{NF} 5775 5776@item @var{command} @code{| getline} @var{var} @tab Sets @var{var} 5777 5778@item @var{command} @code{|& getline} @tab Sets @code{$0} and @code{NF}. 5779This is a @command{gawk} extension 5780 5781@item @var{command} @code{|& getline} @var{var} @tab Sets @var{var}. 5782This is a @command{gawk} extension 5783@end multitable 5784@c ENDOFRANGE getl 5785@c ENDOFRANGE inex 5786@c ENDOFRANGE infir 5787 5788@node Printing 5789@chapter Printing Output 5790 5791@c STARTOFRANGE prnt 5792@cindex printing 5793@cindex output, printing, See printing 5794One of the most common programming actions is to @dfn{print}, or output, 5795some or all of the input. Use the @code{print} statement 5796for simple output, and the @code{printf} statement 5797for fancier formatting. 5798The @code{print} statement is not limited when 5799computing @emph{which} values to print. However, with two exceptions, 5800you cannot specify @emph{how} to print them---how many 5801columns, whether to use exponential notation or not, and so on. 5802(For the exceptions, @pxref{Output Separators}, and 5803@ref{OFMT}.) 5804For printing with specifications, you need the @code{printf} statement 5805(@pxref{Printf}). 5806 5807@c STARTOFRANGE prnts 5808@cindex @code{print} statement 5809@cindex @code{printf} statement 5810Besides basic and formatted printing, this @value{CHAPTER} 5811also covers I/O redirections to files and pipes, introduces 5812the special @value{FN}s that @command{gawk} processes internally, 5813and discusses the @code{close} built-in function. 5814 5815@menu 5816* Print:: The @code{print} statement. 5817* Print Examples:: Simple examples of @code{print} statements. 5818* Output Separators:: The output separators and how to change them. 5819* OFMT:: Controlling Numeric Output With @code{print}. 5820* Printf:: The @code{printf} statement. 5821* Redirection:: How to redirect output to multiple files and 5822 pipes. 5823* Special Files:: File name interpretation in @command{gawk}. 5824 @command{gawk} allows access to inherited file 5825 descriptors. 5826* Close Files And Pipes:: Closing Input and Output Files and Pipes. 5827@end menu 5828 5829@node Print 5830@section The @code{print} Statement 5831 5832The @code{print} statement is used to produce output with simple, standardized 5833formatting. Specify only the strings or numbers to print, in a 5834list separated by commas. They are output, separated by single spaces, 5835followed by a newline. The statement looks like this: 5836 5837@example 5838print @var{item1}, @var{item2}, @dots{} 5839@end example 5840 5841@noindent 5842The entire list of items may be optionally enclosed in parentheses. The 5843parentheses are necessary if any of the item expressions uses the @samp{>} 5844relational operator; otherwise it could be confused with a redirection 5845(@pxref{Redirection}). 5846 5847The items to print can be constant strings or numbers, fields of the 5848current record (such as @code{$1}), variables, or any @command{awk} 5849expression. Numeric values are converted to strings and then printed. 5850 5851@cindex records, printing 5852@cindex lines, blank, printing 5853@cindex text, printing 5854The simple statement @samp{print} with no items is equivalent to 5855@samp{print $0}: it prints the entire current record. To print a blank 5856line, use @samp{print ""}, where @code{""} is the empty string. 5857To print a fixed piece of text, use a string constant, such as 5858@w{@code{"Don't Panic"}}, as one item. If you forget to use the 5859double-quote characters, your text is taken as an @command{awk} 5860expression, and you will probably get an error. Keep in mind that a 5861space is printed between any two items. 5862 5863@node Print Examples 5864@section Examples of @code{print} Statements 5865 5866Each @code{print} statement makes at least one line of output. However, it 5867isn't limited to only one line. If an item value is a string that contains a 5868newline, the newline is output along with the rest of the string. A 5869single @code{print} statement can make any number of lines this way. 5870 5871@cindex newlines, printing 5872The following is an example of printing a string that contains embedded newlines 5873(the @samp{\n} is an escape sequence, used to represent the newline 5874character; @pxref{Escape Sequences}): 5875 5876@example 5877$ awk 'BEGIN @{ print "line one\nline two\nline three" @}' 5878@print{} line one 5879@print{} line two 5880@print{} line three 5881@end example 5882 5883@cindex fields, printing 5884The next example, which is run on the @file{inventory-shipped} file, 5885prints the first two fields of each input record, with a space between 5886them: 5887 5888@example 5889$ awk '@{ print $1, $2 @}' inventory-shipped 5890@print{} Jan 13 5891@print{} Feb 15 5892@print{} Mar 15 5893@dots{} 5894@end example 5895 5896@cindex @code{print} statement, commas, omitting 5897@c comma does NOT start tertiary 5898@cindex troubleshooting, @code{print} statement, omitting commas 5899A common mistake in using the @code{print} statement is to omit the comma 5900between two items. This often has the effect of making the items run 5901together in the output, with no space. The reason for this is that 5902juxtaposing two string expressions in @command{awk} means to concatenate 5903them. Here is the same program, without the comma: 5904 5905@example 5906$ awk '@{ print $1 $2 @}' inventory-shipped 5907@print{} Jan13 5908@print{} Feb15 5909@print{} Mar15 5910@dots{} 5911@end example 5912 5913@c comma does NOT start tertiary 5914@cindex @code{BEGIN} pattern, headings, adding 5915To someone unfamiliar with the @file{inventory-shipped} file, neither 5916example's output makes much sense. A heading line at the beginning 5917would make it clearer. Let's add some headings to our table of months 5918(@code{$1}) and green crates shipped (@code{$2}). We do this using the 5919@code{BEGIN} pattern 5920(@pxref{BEGIN/END}) 5921so that the headings are only printed once: 5922 5923@example 5924awk 'BEGIN @{ print "Month Crates" 5925 print "----- ------" @} 5926 @{ print $1, $2 @}' inventory-shipped 5927@end example 5928 5929@noindent 5930When run, the program prints the following: 5931 5932@example 5933Month Crates 5934----- ------ 5935Jan 13 5936Feb 15 5937Mar 15 5938@dots{} 5939@end example 5940 5941@noindent 5942The only problem, however, is that the headings and the table data 5943don't line up! We can fix this by printing some spaces between the 5944two fields: 5945 5946@example 5947@group 5948awk 'BEGIN @{ print "Month Crates" 5949 print "----- ------" @} 5950 @{ print $1, " ", $2 @}' inventory-shipped 5951@end group 5952@end example 5953 5954@c comma does NOT start tertiary 5955@cindex @code{printf} statement, columns, aligning 5956@cindex columns, aligning 5957Lining up columns this way can get pretty 5958complicated when there are many columns to fix. Counting spaces for two 5959or three columns is simple, but any more than this can take up 5960a lot of time. This is why the @code{printf} statement was 5961created (@pxref{Printf}); 5962one of its specialties is lining up columns of data. 5963 5964@cindex line continuations, in @code{print} statement 5965@cindex @code{print} statement, line continuations and 5966@strong{Note:} You can continue either a @code{print} or 5967@code{printf} statement simply by putting a newline after any comma 5968(@pxref{Statements/Lines}). 5969@c ENDOFRANGE prnts 5970 5971@node Output Separators 5972@section Output Separators 5973 5974@cindex @code{OFS} variable 5975As mentioned previously, a @code{print} statement contains a list 5976of items separated by commas. In the output, the items are normally 5977separated by single spaces. However, this doesn't need to be the case; 5978a single space is only the default. Any string of 5979characters may be used as the @dfn{output field separator} by setting the 5980built-in variable @code{OFS}. The initial value of this variable 5981is the string @w{@code{" "}}---that is, a single space. 5982 5983The output from an entire @code{print} statement is called an 5984@dfn{output record}. Each @code{print} statement outputs one output 5985record, and then outputs a string called the @dfn{output record separator} 5986(or @code{ORS}). The initial 5987value of @code{ORS} is the string @code{"\n"}; i.e., a newline 5988character. Thus, each @code{print} statement normally makes a separate line. 5989 5990@cindex output, records 5991@cindex output record separator, See @code{ORS} variable 5992@cindex @code{ORS} variable 5993@cindex @code{BEGIN} pattern, @code{OFS}/@code{ORS} variables, assigning values to 5994In order to change how output fields and records are separated, assign 5995new values to the variables @code{OFS} and @code{ORS}. The usual 5996place to do this is in the @code{BEGIN} rule 5997(@pxref{BEGIN/END}), so 5998that it happens before any input is processed. It can also be done 5999with assignments on the command line, before the names of the input 6000files, or using the @option{-v} command-line option 6001(@pxref{Options}). 6002The following example prints the first and second fields of each input 6003record, separated by a semicolon, with a blank line added after each 6004newline: 6005 6006@ignore 6007Exercise, 6008Rewrite the 6009@example 6010awk 'BEGIN @{ print "Month Crates" 6011 print "----- ------" @} 6012 @{ print $1, " ", $2 @}' inventory-shipped 6013@end example 6014program by using a new value of @code{OFS}. 6015@end ignore 6016 6017@example 6018$ awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @} 6019> @{ print $1, $2 @}' BBS-list 6020@print{} aardvark;555-5553 6021@print{} 6022@print{} alpo-net;555-3412 6023@print{} 6024@print{} barfly;555-7685 6025@dots{} 6026@end example 6027 6028If the value of @code{ORS} does not contain a newline, the program's output 6029is run together on a single line. 6030 6031@node OFMT 6032@section Controlling Numeric Output with @code{print} 6033@cindex numeric, output format 6034@c the comma does NOT start a secondary 6035@cindex formats, numeric output 6036When the @code{print} statement is used to print numeric values, 6037@command{awk} internally converts the number to a string of characters 6038and prints that string. @command{awk} uses the @code{sprintf} function 6039to do this conversion 6040(@pxref{String Functions}). 6041For now, it suffices to say that the @code{sprintf} 6042function accepts a @dfn{format specification} that tells it how to format 6043numbers (or strings), and that there are a number of different ways in which 6044numbers can be formatted. The different format specifications are discussed 6045more fully in 6046@ref{Control Letters}. 6047 6048@cindex @code{sprintf} function 6049@cindex @code{OFMT} variable 6050@c the comma before OFMT does NOT start a tertiary 6051@cindex output, format specifier, @code{OFMT} 6052The built-in variable @code{OFMT} contains the default format specification 6053that @code{print} uses with @code{sprintf} when it wants to convert a 6054number to a string for printing. 6055The default value of @code{OFMT} is @code{"%.6g"}. 6056The way @code{print} prints numbers can be changed 6057by supplying different format specifications 6058as the value of @code{OFMT}, as shown in the following example: 6059 6060@example 6061$ awk 'BEGIN @{ 6062> OFMT = "%.0f" # print numbers as integers (rounds) 6063> print 17.23, 17.54 @}' 6064@print{} 17 18 6065@end example 6066 6067@noindent 6068@cindex dark corner, @code{OFMT} variable 6069@cindex POSIX @command{awk}, @code{OFMT} variable and 6070@cindex @code{OFMT} variable, POSIX @command{awk} and 6071According to the POSIX standard, @command{awk}'s behavior is undefined 6072if @code{OFMT} contains anything but a floating-point conversion specification. 6073@value{DARKCORNER} 6074 6075@node Printf 6076@section Using @code{printf} Statements for Fancier Printing 6077 6078@c STARTOFRANGE printfs 6079@cindex @code{printf} statement 6080@cindex output, formatted 6081@cindex formatting output 6082For more precise control over the output format than what is 6083normally provided by @code{print}, use @code{printf}. 6084@code{printf} can be used to 6085specify the width to use for each item, as well as various 6086formatting choices for numbers (such as what output base to use, whether to 6087print an exponent, whether to print a sign, and how many digits to print 6088after the decimal point). This is done by supplying a string, called 6089the @dfn{format string}, that controls how and where to print the other 6090arguments. 6091 6092@menu 6093* Basic Printf:: Syntax of the @code{printf} statement. 6094* Control Letters:: Format-control letters. 6095* Format Modifiers:: Format-specification modifiers. 6096* Printf Examples:: Several examples. 6097@end menu 6098 6099@node Basic Printf 6100@subsection Introduction to the @code{printf} Statement 6101 6102@cindex @code{printf} statement, syntax of 6103A simple @code{printf} statement looks like this: 6104 6105@example 6106printf @var{format}, @var{item1}, @var{item2}, @dots{} 6107@end example 6108 6109@noindent 6110The entire list of arguments may optionally be enclosed in parentheses. The 6111parentheses are necessary if any of the item expressions use the @samp{>} 6112relational operator; otherwise, it can be confused with a redirection 6113(@pxref{Redirection}). 6114 6115@cindex format strings 6116The difference between @code{printf} and @code{print} is the @var{format} 6117argument. This is an expression whose value is taken as a string; it 6118specifies how to output each of the other arguments. It is called the 6119@dfn{format string}. 6120 6121The format string is very similar to that in the ISO C library function 6122@code{printf}. Most of @var{format} is text to output verbatim. 6123Scattered among this text are @dfn{format specifiers}---one per item. 6124Each format specifier says to output the next item in the argument list 6125at that place in the format. 6126 6127The @code{printf} statement does not automatically append a newline 6128to its output. It outputs only what the format string specifies. 6129So if a newline is needed, you must include one in the format string. 6130The output separator variables @code{OFS} and @code{ORS} have no effect 6131on @code{printf} statements. For example: 6132 6133@example 6134$ awk 'BEGIN @{ 6135> ORS = "\nOUCH!\n"; OFS = "+" 6136> msg = "Dont Panic!" 6137> printf "%s\n", msg 6138> @}' 6139@print{} Dont Panic! 6140@end example 6141 6142@noindent 6143Here, neither the @samp{+} nor the @samp{OUCH} appear when 6144the message is printed. 6145 6146@node Control Letters 6147@subsection Format-Control Letters 6148@cindex @code{printf} statement, format-control characters 6149@cindex format specifiers, @code{printf} statement 6150 6151A format specifier starts with the character @samp{%} and ends with 6152a @dfn{format-control letter}---it tells the @code{printf} statement 6153how to output one item. The format-control letter specifies what @emph{kind} 6154of value to print. The rest of the format specifier is made up of 6155optional @dfn{modifiers} that control @emph{how} to print the value, such as 6156the field width. Here is a list of the format-control letters: 6157 6158@table @code 6159@item %c 6160This prints a number as an ASCII character; thus, @samp{printf "%c", 616165} outputs the letter @samp{A}. (The output for a string value is 6162the first character of the string.) 6163 6164@item %d@r{,} %i 6165These are equivalent; they both print a decimal integer. 6166(The @samp{%i} specification is for compatibility with ISO C.) 6167 6168@item %e@r{,} %E 6169These print a number in scientific (exponential) notation; 6170for example: 6171 6172@example 6173printf "%4.3e\n", 1950 6174@end example 6175 6176@noindent 6177prints @samp{1.950e+03}, with a total of four significant figures, three of 6178which follow the decimal point. 6179(The @samp{4.3} represents two modifiers, 6180discussed in the next @value{SUBSECTION}.) 6181@samp{%E} uses @samp{E} instead of @samp{e} in the output. 6182 6183@item %f 6184This prints a number in floating-point notation. 6185For example: 6186 6187@example 6188printf "%4.3f", 1950 6189@end example 6190 6191@noindent 6192prints @samp{1950.000}, with a total of four significant figures, three of 6193which follow the decimal point. 6194(The @samp{4.3} represents two modifiers, 6195discussed in the next @value{SUBSECTION}.) 6196 6197@item %g@r{,} %G 6198These print a number in either scientific notation or in floating-point 6199notation, whichever uses fewer characters; if the result is printed in 6200scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}. 6201 6202@item %o 6203This prints an unsigned octal integer. 6204 6205@item %s 6206This prints a string. 6207 6208@item %u 6209This prints an unsigned decimal integer. 6210(This format is of marginal use, because all numbers in @command{awk} 6211are floating-point; it is provided primarily for compatibility with C.) 6212 6213@item %x@r{,} %X 6214These print an unsigned hexadecimal integer; 6215@samp{%X} uses the letters @samp{A} through @samp{F} 6216instead of @samp{a} through @samp{f}. 6217 6218@item %% 6219This isn't a format-control letter, but it does have meaning---the 6220sequence @samp{%%} outputs one @samp{%}; it does not consume an 6221argument and it ignores any modifiers. 6222@end table 6223 6224@cindex dark corner, format-control characters 6225@cindex @command{gawk}, format-control characters 6226@strong{Note:} 6227When using the integer format-control letters for values that are 6228outside the range of the widest C integer type, @command{gawk} switches to the 6229the @samp{%g} format specifier. If @option{--lint} is provided on the 6230command line (@pxref{Options}), @command{gawk} 6231warns about this. Other versions of @command{awk} may print invalid 6232values or do something else entirely. 6233@value{DARKCORNER} 6234 6235@node Format Modifiers 6236@subsection Modifiers for @code{printf} Formats 6237 6238@c STARTOFRANGE pfm 6239@cindex @code{printf} statement, modifiers 6240@c the comma here does NOT start a secondary 6241@cindex modifiers, in format specifiers 6242A format specification can also include @dfn{modifiers} that can control 6243how much of the item's value is printed, as well as how much space it gets. 6244The modifiers come between the @samp{%} and the format-control letter. 6245We will use the bullet symbol ``@bullet{}'' in the following examples to 6246represent 6247spaces in the output. Here are the possible modifiers, in the order in 6248which they may appear: 6249 6250@table @code 6251@cindex differences in @command{awk} and @command{gawk}, @code{print}/@code{printf} statements 6252@cindex @code{printf} statement, positional specifiers 6253@c the command does NOT start a secondary 6254@cindex positional specifiers, @code{printf} statement 6255@item @var{N}$ 6256An integer constant followed by a @samp{$} is a @dfn{positional specifier}. 6257Normally, format specifications are applied to arguments in the order 6258given in the format string. With a positional specifier, the format 6259specification is applied to a specific argument, instead of what 6260would be the next argument in the list. Positional specifiers begin 6261counting with one. Thus: 6262 6263@example 6264printf "%s %s\n", "don't", "panic" 6265printf "%2$s %1$s\n", "panic", "don't" 6266@end example 6267 6268@noindent 6269prints the famous friendly message twice. 6270 6271At first glance, this feature doesn't seem to be of much use. 6272It is in fact a @command{gawk} extension, intended for use in translating 6273messages at runtime. 6274@xref{Printf Ordering}, 6275which describes how and why to use positional specifiers. 6276For now, we will not use them. 6277 6278@item - 6279The minus sign, used before the width modifier (see later on in 6280this table), 6281says to left-justify 6282the argument within its specified width. Normally, the argument 6283is printed right-justified in the specified width. Thus: 6284 6285@example 6286printf "%-4s", "foo" 6287@end example 6288 6289@noindent 6290prints @samp{foo@bullet{}}. 6291 6292@item @var{space} 6293For numeric conversions, prefix positive values with a space and 6294negative values with a minus sign. 6295 6296@item + 6297The plus sign, used before the width modifier (see later on in 6298this table), 6299says to always supply a sign for numeric conversions, even if the data 6300to format is positive. The @samp{+} overrides the space modifier. 6301 6302@item # 6303Use an ``alternate form'' for certain control letters. 6304For @samp{%o}, supply a leading zero. 6305For @samp{%x} and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for 6306a nonzero result. 6307For @samp{%e}, @samp{%E}, and @samp{%f}, the result always contains a 6308decimal point. 6309For @samp{%g} and @samp{%G}, trailing zeros are not removed from the result. 6310 6311@cindex dark corner 6312@item 0 6313A leading @samp{0} (zero) acts as a flag that indicates that output should be 6314padded with zeros instead of spaces. 6315This applies even to non-numeric output formats. 6316@value{DARKCORNER} 6317This flag only has an effect when the field width is wider than the 6318value to print. 6319 6320@item @var{width} 6321This is a number specifying the desired minimum width of a field. Inserting any 6322number between the @samp{%} sign and the format-control character forces the 6323field to expand to this width. The default way to do this is to 6324pad with spaces on the left. For example: 6325 6326@example 6327printf "%4s", "foo" 6328@end example 6329 6330@noindent 6331prints @samp{@bullet{}foo}. 6332 6333The value of @var{width} is a minimum width, not a maximum. If the item 6334value requires more than @var{width} characters, it can be as wide as 6335necessary. Thus, the following: 6336 6337@example 6338printf "%4s", "foobar" 6339@end example 6340 6341@noindent 6342prints @samp{foobar}. 6343 6344Preceding the @var{width} with a minus sign causes the output to be 6345padded with spaces on the right, instead of on the left. 6346 6347@item .@var{prec} 6348A period followed by an integer constant 6349specifies the precision to use when printing. 6350The meaning of the precision varies by control letter: 6351 6352@table @asis 6353@item @code{%e}, @code{%E}, @code{%f} 6354Number of digits to the right of the decimal point. 6355 6356@item @code{%g}, @code{%G} 6357Maximum number of significant digits. 6358 6359@item @code{%d}, @code{%i}, @code{%o}, @code{%u}, @code{%x}, @code{%X} 6360Minimum number of digits to print. 6361 6362@item @code{%s} 6363Maximum number of characters from the string that should print. 6364@end table 6365 6366Thus, the following: 6367 6368@example 6369printf "%.4s", "foobar" 6370@end example 6371 6372@noindent 6373prints @samp{foob}. 6374@end table 6375 6376The C library @code{printf}'s dynamic @var{width} and @var{prec} 6377capability (for example, @code{"%*.*s"}) is supported. Instead of 6378supplying explicit @var{width} and/or @var{prec} values in the format 6379string, they are passed in the argument list. For example: 6380 6381@example 6382w = 5 6383p = 3 6384s = "abcdefg" 6385printf "%*.*s\n", w, p, s 6386@end example 6387 6388@noindent 6389is exactly equivalent to: 6390 6391@example 6392s = "abcdefg" 6393printf "%5.3s\n", s 6394@end example 6395 6396@noindent 6397Both programs output @samp{@w{@bullet{}@bullet{}abc}}. 6398Earlier versions of @command{awk} did not support this capability. 6399If you must use such a version, you may simulate this feature by using 6400concatenation to build up the format string, like so: 6401 6402@example 6403w = 5 6404p = 3 6405s = "abcdefg" 6406printf "%" w "." p "s\n", s 6407@end example 6408 6409@noindent 6410This is not particularly easy to read but it does work. 6411 6412@c @cindex lint checks 6413@cindex troubleshooting, fatal errors, @code{printf} format strings 6414@cindex POSIX @command{awk}, @code{printf} format strings and 6415C programmers may be used to supplying additional 6416@samp{l}, @samp{L}, and @samp{h} 6417modifiers in @code{printf} format strings. These are not valid in @command{awk}. 6418Most @command{awk} implementations silently ignore these modifiers. 6419If @option{--lint} is provided on the command line 6420(@pxref{Options}), 6421@command{gawk} warns about their use. If @option{--posix} is supplied, 6422their use is a fatal error. 6423@c ENDOFRANGE pfm 6424 6425@node Printf Examples 6426@subsection Examples Using @code{printf} 6427 6428The following is a simple example of 6429how to use @code{printf} to make an aligned table: 6430 6431@example 6432awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list 6433@end example 6434 6435@noindent 6436This command 6437prints the names of the bulletin boards (@code{$1}) in the file 6438@file{BBS-list} as a string of 10 characters that are left-justified. It also 6439prints the phone numbers (@code{$2}) next on the line. This 6440produces an aligned two-column table of names and phone numbers, 6441as shown here: 6442 6443@example 6444$ awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list 6445@print{} aardvark 555-5553 6446@print{} alpo-net 555-3412 6447@print{} barfly 555-7685 6448@print{} bites 555-1675 6449@print{} camelot 555-0542 6450@print{} core 555-2912 6451@print{} fooey 555-1234 6452@print{} foot 555-6699 6453@print{} macfoo 555-6480 6454@print{} sdace 555-3430 6455@print{} sabafoo 555-2127 6456@end example 6457 6458In this case, the phone numbers had to be printed as strings because 6459the numbers are separated by a dash. Printing the phone numbers as 6460numbers would have produced just the first three digits: @samp{555}. 6461This would have been pretty confusing. 6462 6463It wasn't necessary to specify a width for the phone numbers because 6464they are last on their lines. They don't need to have spaces 6465after them. 6466 6467The table could be made to look even nicer by adding headings to the 6468tops of the columns. This is done using the @code{BEGIN} pattern 6469(@pxref{BEGIN/END}) 6470so that the headers are only printed once, at the beginning of 6471the @command{awk} program: 6472 6473@example 6474awk 'BEGIN @{ print "Name Number" 6475 print "---- ------" @} 6476 @{ printf "%-10s %s\n", $1, $2 @}' BBS-list 6477@end example 6478 6479The above example mixed @code{print} and @code{printf} statements in 6480the same program. Using just @code{printf} statements can produce the 6481same results: 6482 6483@example 6484awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number" 6485 printf "%-10s %s\n", "----", "------" @} 6486 @{ printf "%-10s %s\n", $1, $2 @}' BBS-list 6487@end example 6488 6489@noindent 6490Printing each column heading with the same format specification 6491used for the column elements ensures that the headings 6492are aligned just like the columns. 6493 6494The fact that the same format specification is used three times can be 6495emphasized by storing it in a variable, like this: 6496 6497@example 6498awk 'BEGIN @{ format = "%-10s %s\n" 6499 printf format, "Name", "Number" 6500 printf format, "----", "------" @} 6501 @{ printf format, $1, $2 @}' BBS-list 6502@end example 6503 6504@c !!! exercise 6505At this point, it would be a worthwhile exercise to use the 6506@code{printf} statement to line up the headings and table data for the 6507@file{inventory-shipped} example that was covered earlier in the @value{SECTION} 6508on the @code{print} statement 6509(@pxref{Print}). 6510@c ENDOFRANGE printfs 6511 6512@node Redirection 6513@section Redirecting Output of @code{print} and @code{printf} 6514 6515@cindex output redirection 6516@cindex redirection of output 6517So far, the output from @code{print} and @code{printf} has gone 6518to the standard 6519output, usually the terminal. Both @code{print} and @code{printf} can 6520also send their output to other places. 6521This is called @dfn{redirection}. 6522 6523A redirection appears after the @code{print} or @code{printf} statement. 6524Redirections in @command{awk} are written just like redirections in shell 6525commands, except that they are written inside the @command{awk} program. 6526 6527@c the commas here are part of the see also 6528@cindex @code{print} statement, See Also redirection, of output 6529@cindex @code{printf} statement, See Also redirection, of output 6530There are four forms of output redirection: output to a file, output 6531appended to a file, output through a pipe to another command, and output 6532to a coprocess. They are all shown for the @code{print} statement, 6533but they work identically for @code{printf}: 6534 6535@table @code 6536@cindex @code{>} (right angle bracket), @code{>} operator (I/O) 6537@cindex right angle bracket (@code{>}), @code{>} operator (I/O) 6538@cindex operators, input/output 6539@item print @var{items} > @var{output-file} 6540This type of redirection prints the items into the output file named 6541@var{output-file}. The @value{FN} @var{output-file} can be any 6542expression. Its value is changed to a string and then used as a 6543@value{FN} (@pxref{Expressions}). 6544 6545When this type of redirection is used, the @var{output-file} is erased 6546before the first output is written to it. Subsequent writes to the same 6547@var{output-file} do not erase @var{output-file}, but append to it. 6548(This is different from how you use redirections in shell scripts.) 6549If @var{output-file} does not exist, it is created. For example, here 6550is how an @command{awk} program can write a list of BBS names to one 6551file named @file{name-list}, and a list of phone numbers to another file 6552named @file{phone-list}: 6553 6554@example 6555$ awk '@{ print $2 > "phone-list" 6556> print $1 > "name-list" @}' BBS-list 6557$ cat phone-list 6558@print{} 555-5553 6559@print{} 555-3412 6560@dots{} 6561$ cat name-list 6562@print{} aardvark 6563@print{} alpo-net 6564@dots{} 6565@end example 6566 6567@noindent 6568Each output file contains one name or number per line. 6569 6570@cindex @code{>} (right angle bracket), @code{>>} operator (I/O) 6571@cindex right angle bracket (@code{>}), @code{>>} operator (I/O) 6572@item print @var{items} >> @var{output-file} 6573This type of redirection prints the items into the pre-existing output file 6574named @var{output-file}. The difference between this and the 6575single-@samp{>} redirection is that the old contents (if any) of 6576@var{output-file} are not erased. Instead, the @command{awk} output is 6577appended to the file. 6578If @var{output-file} does not exist, then it is created. 6579 6580@cindex @code{|} (vertical bar), @code{|} operator (I/O) 6581@cindex pipes, output 6582@cindex output, pipes 6583@item print @var{items} | @var{command} 6584It is also possible to send output to another program through a pipe 6585instead of into a file. This type of redirection opens a pipe to 6586@var{command}, and writes the values of @var{items} through this pipe 6587to another process created to execute @var{command}. 6588 6589The redirection argument @var{command} is actually an @command{awk} 6590expression. Its value is converted to a string whose contents give 6591the shell command to be run. For example, the following produces two 6592files, one unsorted list of BBS names, and one list sorted in reverse 6593alphabetical order: 6594 6595@ignore 659610/2000: 6597This isn't the best style, since COMMAND is assigned for each 6598record. It's done to avoid overfull hboxes in TeX. Leave it 6599alone for now and let's hope no-one notices. 6600@end ignore 6601 6602@example 6603awk '@{ print $1 > "names.unsorted" 6604 command = "sort -r > names.sorted" 6605 print $1 | command @}' BBS-list 6606@end example 6607 6608The unsorted list is written with an ordinary redirection, while 6609the sorted list is written by piping through the @command{sort} utility. 6610 6611The next example uses redirection to mail a message to the mailing 6612list @samp{bug-system}. This might be useful when trouble is encountered 6613in an @command{awk} script run periodically for system maintenance: 6614 6615@example 6616report = "mail bug-system" 6617print "Awk script failed:", $0 | report 6618m = ("at record number " FNR " of " FILENAME) 6619print m | report 6620close(report) 6621@end example 6622 6623The message is built using string concatenation and saved in the variable 6624@code{m}. It's then sent down the pipeline to the @command{mail} program. 6625(The parentheses group the items to concatenate---see 6626@ref{Concatenation}.) 6627 6628The @code{close} function is called here because it's a good idea to close 6629the pipe as soon as all the intended output has been sent to it. 6630@xref{Close Files And Pipes}, 6631for more information. 6632 6633This example also illustrates the use of a variable to represent 6634a @var{file} or @var{command}---it is not necessary to always 6635use a string constant. Using a variable is generally a good idea, 6636because @command{awk} requires that the string value be spelled identically 6637every time. 6638 6639@cindex coprocesses 6640@cindex @code{|} (vertical bar), @code{|&} operator (I/O) 6641@cindex operators, input/output 6642@cindex differences in @command{awk} and @command{gawk}, input/output operators 6643@item print @var{items} |& @var{command} 6644This type of redirection prints the items to the input of @var{command}. 6645The difference between this and the 6646single-@samp{|} redirection is that the output from @var{command} 6647can be read with @code{getline}. 6648Thus @var{command} is a @dfn{coprocess}, which works together with, 6649but subsidiary to, the @command{awk} program. 6650 6651This feature is a @command{gawk} extension, and is not available in 6652POSIX @command{awk}. 6653@xref{Two-way I/O}, 6654for a more complete discussion. 6655@end table 6656 6657Redirecting output using @samp{>}, @samp{>>}, @samp{|}, or @samp{|&} 6658asks the system to open a file, pipe, or coprocess only if the particular 6659@var{file} or @var{command} you specify has not already been written 6660to by your program or if it has been closed since it was last written to. 6661 6662@cindex troubleshooting, printing 6663It is a common error to use @samp{>} redirection for the first @code{print} 6664to a file, and then to use @samp{>>} for subsequent output: 6665 6666@example 6667# clear the file 6668print "Don't panic" > "guide.txt" 6669@dots{} 6670# append 6671print "Avoid improbability generators" >> "guide.txt" 6672@end example 6673 6674@noindent 6675This is indeed how redirections must be used from the shell. But in 6676@command{awk}, it isn't necessary. In this kind of case, a program should 6677use @samp{>} for all the @code{print} statements, since the output file 6678is only opened once. 6679 6680@cindex differences in @command{awk} and @command{gawk}, implementation limitations 6681@c the comma here does NOT start a secondary 6682@cindex implementation issues, @command{gawk}, limits 6683@cindex @command{awk}, implementation issues, pipes 6684@cindex @command{gawk}, implementation issues, pipes 6685@ifnotinfo 6686As mentioned earlier 6687(@pxref{Getline Notes}), 6688many 6689@end ifnotinfo 6690@ifnottex 6691Many 6692@end ifnottex 6693@command{awk} implementations limit the number of pipelines that an @command{awk} 6694program may have open to just one! In @command{gawk}, there is no such limit. 6695@command{gawk} allows a program to 6696open as many pipelines as the underlying operating system permits. 6697 6698@c fakenode --- for prepinfo 6699@subheading Advanced Notes: Piping into @command{sh} 6700@cindex advanced features, piping into @command{sh} 6701@cindex shells, piping commands into 6702 6703A particularly powerful way to use redirection is to build command lines 6704and pipe them into the shell, @command{sh}. For example, suppose you 6705have a list of files brought over from a system where all the @value{FN}s 6706are stored in uppercase, and you wish to rename them to have names in 6707all lowercase. The following program is both simple and efficient: 6708 6709@c @cindex @command{mv} utility 6710@example 6711@{ printf("mv %s %s\n", $0, tolower($0)) | "sh" @} 6712 6713END @{ close("sh") @} 6714@end example 6715 6716The @code{tolower} function returns its argument string with all 6717uppercase characters converted to lowercase 6718(@pxref{String Functions}). 6719The program builds up a list of command lines, 6720using the @command{mv} utility to rename the files. 6721It then sends the list to the shell for execution. 6722@c ENDOFRANGE outre 6723@c ENDOFRANGE reout 6724 6725@node Special Files 6726@section Special @value{FFN}s in @command{gawk} 6727@c STARTOFRANGE gfn 6728@cindex @command{gawk}, @value{FN}s in 6729 6730@command{gawk} provides a number of special @value{FN}s that it interprets 6731internally. These @value{FN}s provide access to standard file descriptors, 6732process-related information, and TCP/IP networking. 6733 6734@menu 6735* Special FD:: Special files for I/O. 6736* Special Process:: Special files for process information. 6737* Special Network:: Special files for network communications. 6738* Special Caveats:: Things to watch out for. 6739@end menu 6740 6741@node Special FD 6742@subsection Special Files for Standard Descriptors 6743@cindex standard input 6744@cindex input, standard 6745@cindex standard output 6746@cindex output, standard 6747@cindex error output 6748@cindex file descriptors 6749@cindex files, descriptors, See file descriptors 6750 6751Running programs conventionally have three input and output streams 6752already available to them for reading and writing. These are known as 6753the @dfn{standard input}, @dfn{standard output}, and @dfn{standard error 6754output}. These streams are, by default, connected to your terminal, but 6755they are often redirected with the shell, via the @samp{<}, @samp{<<}, 6756@samp{>}, @samp{>>}, @samp{>&}, and @samp{|} operators. Standard error 6757is typically used for writing error messages; the reason there are two separate 6758streams, standard output and standard error, is so that they can be 6759redirected separately. 6760 6761@cindex differences in @command{awk} and @command{gawk}, error messages 6762@cindex error handling 6763In other implementations of @command{awk}, the only way to write an error 6764message to standard error in an @command{awk} program is as follows: 6765 6766@example 6767print "Serious error detected!" | "cat 1>&2" 6768@end example 6769 6770@noindent 6771This works by opening a pipeline to a shell command that can access the 6772standard error stream that it inherits from the @command{awk} process. 6773This is far from elegant, and it is also inefficient, because it requires a 6774separate process. So people writing @command{awk} programs often 6775don't do this. Instead, they send the error messages to the 6776terminal, like this: 6777 6778@example 6779print "Serious error detected!" > "/dev/tty" 6780@end example 6781 6782@noindent 6783This usually has the same effect but not always: although the 6784standard error stream is usually the terminal, it can be redirected; when 6785that happens, writing to the terminal is not correct. In fact, if 6786@command{awk} is run from a background job, it may not have a terminal at all. 6787Then opening @file{/dev/tty} fails. 6788 6789@command{gawk} provides special @value{FN}s for accessing the three standard 6790streams, as well as any other inherited open files. If the @value{FN} matches 6791one of these special names when @command{gawk} redirects input or output, 6792then it directly uses the stream that the @value{FN} stands for. 6793These special @value{FN}s work for all operating systems that @command{gawk} 6794has been ported to, not just those that are POSIX-compliant: 6795 6796@cindex @value{FN}s, standard streams in @command{gawk} 6797@cindex @code{/dev/@dots{}} special files (@command{gawk}) 6798@cindex files, @code{/dev/@dots{}} special files 6799@c @cindex @code{/dev/stdin} special file 6800@c @cindex @code{/dev/stdout} special file 6801@c @cindex @code{/dev/stderr} special file 6802@c @cindex @code{/dev/fd} special files 6803@table @file 6804@item /dev/stdin 6805The standard input (file descriptor 0). 6806 6807@item /dev/stdout 6808The standard output (file descriptor 1). 6809 6810@item /dev/stderr 6811The standard error output (file descriptor 2). 6812 6813@item /dev/fd/@var{N} 6814The file associated with file descriptor @var{N}. Such a file must 6815be opened by the program initiating the @command{awk} execution (typically 6816the shell). Unless special pains are taken in the shell from which 6817@command{gawk} is invoked, only descriptors 0, 1, and 2 are available. 6818@end table 6819 6820The @value{FN}s @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr} 6821are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2}, 6822respectively. However, they are more self-explanatory. 6823The proper way to write an error message in a @command{gawk} program 6824is to use @file{/dev/stderr}, like this: 6825 6826@example 6827print "Serious error detected!" > "/dev/stderr" 6828@end example 6829 6830@cindex troubleshooting, quotes with @value{FN}s 6831Note the use of quotes around the @value{FN}. 6832Like any other redirection, the value must be a string. 6833It is a common error to omit the quotes, which leads 6834to confusing results. 6835@c Exercise: What does it do? :-) 6836 6837@node Special Process 6838@subsection Special Files for Process-Related Information 6839 6840@cindex files, for process information 6841@cindex process information, files for 6842@command{gawk} also provides special @value{FN}s that give access to information 6843about the running @command{gawk} process. Each of these ``files'' provides 6844a single record of information. To read them more than once, they must 6845first be closed with the @code{close} function 6846(@pxref{Close Files And Pipes}). 6847The @value{FN}s are: 6848 6849@c @cindex @code{/dev/pid} special file 6850@c @cindex @code{/dev/pgrpid} special file 6851@c @cindex @code{/dev/ppid} special file 6852@c @cindex @code{/dev/user} special file 6853@table @file 6854@item /dev/pid 6855Reading this file returns the process ID of the current process, 6856in decimal form, terminated with a newline. 6857 6858@item /dev/ppid 6859Reading this file returns the parent process ID of the current process, 6860in decimal form, terminated with a newline. 6861 6862@item /dev/pgrpid 6863Reading this file returns the process group ID of the current process, 6864in decimal form, terminated with a newline. 6865 6866@item /dev/user 6867Reading this file returns a single record terminated with a newline. 6868The fields are separated with spaces. The fields represent the 6869following information: 6870 6871@table @code 6872@item $1 6873The return value of the @code{getuid} system call 6874(the real user ID number). 6875 6876@item $2 6877The return value of the @code{geteuid} system call 6878(the effective user ID number). 6879 6880@item $3 6881The return value of the @code{getgid} system call 6882(the real group ID number). 6883 6884@item $4 6885The return value of the @code{getegid} system call 6886(the effective group ID number). 6887@end table 6888 6889If there are any additional fields, they are the group IDs returned by 6890the @code{getgroups} system call. 6891(Multiple groups may not be supported on all systems.) 6892@end table 6893 6894These special @value{FN}s may be used on the command line as @value{DF}s, 6895as well as for I/O redirections within an @command{awk} program. 6896They may not be used as source files with the @option{-f} option. 6897 6898@c @cindex automatic warnings 6899@c @cindex warnings, automatic 6900@strong{Note:} 6901The special files that provide process-related information are now considered 6902obsolete and will disappear entirely 6903in the next release of @command{gawk}. 6904@command{gawk} prints a warning message every time you use one of 6905these files. 6906To obtain process-related information, use the @code{PROCINFO} array. 6907@xref{Auto-set}. 6908 6909@node Special Network 6910@subsection Special Files for Network Communications 6911@cindex networks, support for 6912@cindex TCP/IP, support for 6913 6914Starting with @value{PVERSION} 3.1 of @command{gawk}, @command{awk} programs 6915can open a two-way 6916TCP/IP connection, acting as either a client or a server. 6917This is done using a special @value{FN} of the form: 6918 6919@example 6920@file{/inet/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}} 6921@end example 6922 6923The @var{protocol} is one of @samp{tcp}, @samp{udp}, or @samp{raw}, 6924and the other fields represent the other essential pieces of information 6925for making a networking connection. 6926These @value{FN}s are used with the @samp{|&} operator for communicating 6927with a coprocess 6928(@pxref{Two-way I/O}). 6929This is an advanced feature, mentioned here only for completeness. 6930Full discussion is delayed until 6931@ref{TCP/IP Networking}. 6932 6933@node Special Caveats 6934@subsection Special @value{FFN} Caveats 6935 6936Here is a list of things to bear in mind when using the 6937special @value{FN}s that @command{gawk} provides: 6938 6939@itemize @bullet 6940@cindex compatibility mode (@command{gawk}), @value{FN}s 6941@cindex @value{FN}s, in compatibility mode 6942@item 6943Recognition of these special @value{FN}s is disabled if @command{gawk} is in 6944compatibility mode (@pxref{Options}). 6945 6946@c @cindex automatic warnings 6947@c @cindex warnings, automatic 6948@cindex @code{PROCINFO} array 6949@item 6950@ifnottex 6951The 6952@end ifnottex 6953@ifnotinfo 6954As mentioned earlier, the 6955@end ifnotinfo 6956special files that provide process-related information are now considered 6957obsolete and will disappear entirely 6958in the next release of @command{gawk}. 6959@command{gawk} prints a warning message every time you use one of 6960these files. 6961@ifnottex 6962To obtain process-related information, use the @code{PROCINFO} array. 6963@xref{Built-in Variables}. 6964@end ifnottex 6965 6966@item 6967Starting with @value{PVERSION} 3.1, @command{gawk} @emph{always} 6968interprets these special @value{FN}s.@footnote{Older versions of 6969@command{gawk} would interpret these names internally only if the system 6970did not actually have a @file{/dev/fd} directory or any of the other 6971special files listed earlier. Usually this didn't make a difference, 6972but sometimes it did; thus, it was decided to make @command{gawk}'s 6973behavior consistent on all systems and to have it always interpret 6974the special @value{FN}s itself.} 6975For example, using @samp{/dev/fd/4} 6976for output actually writes on file descriptor 4, and not on a new 6977file descriptor that is @code{dup}'ed from file descriptor 4. Most of 6978the time this does not matter; however, it is important to @emph{not} 6979close any of the files related to file descriptors 0, 1, and 2. 6980Doing so results in unpredictable behavior. 6981@end itemize 6982@c ENDOFRANGE gfn 6983 6984@node Close Files And Pipes 6985@section Closing Input and Output Redirections 6986@cindex files, output, See output files 6987@c STARTOFRANGE ifc 6988@cindex input files, closing 6989@c comma before closing is NOT start of tertiary 6990@c STARTOFRANGE ofc 6991@cindex output, files, closing 6992@c STARTOFRANGE pc 6993@cindex pipes, closing 6994@c STARTOFRANGE cc 6995@cindex coprocesses, closing 6996@c comma before using is NOT start of tertiary 6997@cindex @code{getline} command, coprocesses, using from 6998 6999If the same @value{FN} or the same shell command is used with @code{getline} 7000more than once during the execution of an @command{awk} program 7001(@pxref{Getline}), 7002the file is opened (or the command is executed) the first time only. 7003At that time, the first record of input is read from that file or command. 7004The next time the same file or command is used with @code{getline}, 7005another record is read from it, and so on. 7006 7007Similarly, when a file or pipe is opened for output, the @value{FN} or 7008command associated with it is remembered by @command{awk}, and subsequent 7009writes to the same file or command are appended to the previous writes. 7010The file or pipe stays open until @command{awk} exits. 7011 7012@cindex @code{close} function 7013This implies that special steps are necessary in order to read the same 7014file again from the beginning, or to rerun a shell command (rather than 7015reading more output from the same command). The @code{close} function 7016makes these things possible: 7017 7018@example 7019close(@var{filename}) 7020@end example 7021 7022@noindent 7023or: 7024 7025@example 7026close(@var{command}) 7027@end example 7028 7029The argument @var{filename} or @var{command} can be any expression. Its 7030value must @emph{exactly} match the string that was used to open the file or 7031start the command (spaces and other ``irrelevant'' characters 7032included). For example, if you open a pipe with this: 7033 7034@example 7035"sort -r names" | getline foo 7036@end example 7037 7038@noindent 7039then you must close it with this: 7040 7041@example 7042close("sort -r names") 7043@end example 7044 7045Once this function call is executed, the next @code{getline} from that 7046file or command, or the next @code{print} or @code{printf} to that 7047file or command, reopens the file or reruns the command. 7048Because the expression that you use to close a file or pipeline must 7049exactly match the expression used to open the file or run the command, 7050it is good practice to use a variable to store the @value{FN} or command. 7051The previous example becomes the following: 7052 7053@example 7054sortcom = "sort -r names" 7055sortcom | getline foo 7056@dots{} 7057close(sortcom) 7058@end example 7059 7060@noindent 7061This helps avoid hard-to-find typographical errors in your @command{awk} 7062programs. Here are some of the reasons for closing an output file: 7063 7064@itemize @bullet 7065@item 7066To write a file and read it back later on in the same @command{awk} 7067program. Close the file after writing it, then 7068begin reading it with @code{getline}. 7069 7070@item 7071To write numerous files, successively, in the same @command{awk} 7072program. If the files aren't closed, eventually @command{awk} may exceed a 7073system limit on the number of open files in one process. It is best to 7074close each one when the program has finished writing it. 7075 7076@item 7077To make a command finish. When output is redirected through a pipe, 7078the command reading the pipe normally continues to try to read input 7079as long as the pipe is open. Often this means the command cannot 7080really do its work until the pipe is closed. For example, if 7081output is redirected to the @command{mail} program, the message is not 7082actually sent until the pipe is closed. 7083 7084@item 7085To run the same program a second time, with the same arguments. 7086This is not the same thing as giving more input to the first run! 7087 7088For example, suppose a program pipes output to the @command{mail} program. 7089If it outputs several lines redirected to this pipe without closing 7090it, they make a single message of several lines. By contrast, if the 7091program closes the pipe after each line of output, then each line makes 7092a separate message. 7093@end itemize 7094 7095@cindex differences in @command{awk} and @command{gawk}, @code{close} function 7096@cindex portability, @code{close} function and 7097If you use more files than the system allows you to have open, 7098@command{gawk} attempts to multiplex the available open files among 7099your @value{DF}s. @command{gawk}'s ability to do this depends upon the 7100facilities of your operating system, so it may not always work. It is 7101therefore both good practice and good portability advice to always 7102use @code{close} on your files when you are done with them. 7103In fact, if you are using a lot of pipes, it is essential that 7104you close commands when done. For example, consider something like this: 7105 7106@example 7107@{ 7108 @dots{} 7109 command = ("grep " $1 " /some/file | my_prog -q " $3) 7110 while ((command | getline) > 0) @{ 7111 @var{process output of} command 7112 @} 7113 # need close(command) here 7114@} 7115@end example 7116 7117This example creates a new pipeline based on data in @emph{each} record. 7118Without the call to @code{close} indicated in the comment, @command{awk} 7119creates child processes to run the commands, until it eventually 7120runs out of file descriptors for more pipelines. 7121 7122Even though each command has finished (as indicated by the end-of-file 7123return status from @code{getline}), the child process is not 7124terminated;@footnote{The technical terminology is rather morbid. 7125The finished child is called a ``zombie,'' and cleaning up after 7126it is referred to as ``reaping.''} 7127@c Good old UNIX: give the marketing guys fits, that's the ticket 7128more importantly, the file descriptor for the pipe 7129is not closed and released until @code{close} is called or 7130@command{awk} exits. 7131 7132@code{close} will silently do nothing if given an argument that 7133does not represent a file, pipe or coprocess that was opened with 7134a redirection. 7135 7136Note also that @samp{close(FILENAME)} has no 7137``magic'' effects on the implicit loop that reads through the 7138files named on the command line. It is, more likely, a close 7139of a file that was never opened, so @command{awk} silently 7140does nothing. 7141 7142@c comma is part of tertiary 7143@cindex @code{|} (vertical bar), @code{|&} operator (I/O), pipes, closing 7144When using the @samp{|&} operator to communicate with a coprocess, 7145it is occasionally useful to be able to close one end of the two-way 7146pipe without closing the other. 7147This is done by supplying a second argument to @code{close}. 7148As in any other call to @code{close}, 7149the first argument is the name of the command or special file used 7150to start the coprocess. 7151The second argument should be a string, with either of the values 7152@code{"to"} or @code{"from"}. Case does not matter. 7153As this is an advanced feature, a more complete discussion is 7154delayed until 7155@ref{Two-way I/O}, 7156which discusses it in more detail and gives an example. 7157 7158@c fakenode --- for prepinfo 7159@subheading Advanced Notes: Using @code{close}'s Return Value 7160@cindex advanced features, @code{close} function 7161@cindex dark corner, @code{close} function 7162@cindex @code{close} function, return values 7163@c comma does NOT start secondary 7164@cindex return values, @code{close} function 7165@cindex differences in @command{awk} and @command{gawk}, @code{close} function 7166@cindex Unix @command{awk}, @code{close} function and 7167 7168In many versions of Unix @command{awk}, the @code{close} function 7169is actually a statement. It is a syntax error to try and use the return 7170value from @code{close}: 7171@value{DARKCORNER} 7172 7173@example 7174command = "@dots{}" 7175command | getline info 7176retval = close(command) # syntax error in most Unix awks 7177@end example 7178 7179@command{gawk} treats @code{close} as a function. 7180The return value is @minus{}1 if the argument names something 7181that was never opened with a redirection, or if there is 7182a system problem closing the file or process. 7183In these cases, @command{gawk} sets the built-in variable 7184@code{ERRNO} to a string describing the problem. 7185 7186In @command{gawk}, 7187when closing a pipe or coprocess, 7188the return value is the exit status of the command.@footnote{ 7189This is a full 16-bit value as returned by the @code{wait} 7190system call. See the system manual pages for information on 7191how to decode this value.} 7192Otherwise, it is the return value from the system's @code{close} or 7193@code{fclose} C functions when closing input or output 7194files, respectively. 7195This value is zero if the close succeeds, or @minus{}1 if 7196it fails. 7197 7198The POSIX standard is very vague; it says that @code{close} 7199returns zero on success and non-zero otherwise. In general, 7200different implementations vary in what they report when closing 7201pipes; thus the return value cannot be used portably. 7202@value{DARKCORNER} 7203 7204@ignore 7205@c 4/27/2003: Commenting this out for now, given the above 7206@c return of 16-bit value 7207The return value for closing a pipeline is particularly useful. 7208It allows you to get the output from a command as well as its 7209exit status. 7210@c 8/21/2002, FIXME: Maybe the code and this doc should be adjusted to 7211@c create values indicating death-by-signal? Sigh. 7212 7213@cindex pipes, closing 7214@c comma does NOT start tertiary 7215@cindex POSIX @command{awk}, pipes, closing 7216For POSIX-compliant systems, 7217if the exit status is a number above 128, then the program 7218was terminated by a signal. Subtract 128 to get the signal number: 7219 7220@example 7221exit_val = close(command) 7222if (exit_val > 128) 7223 print command, "died with signal", exit_val - 128 7224else 7225 print command, "exited with code", exit_val 7226@end example 7227 7228Currently, in @command{gawk}, this only works for commands 7229piping into @code{getline}. For commands piped into 7230from @code{print} or @code{printf}, the 7231return value from @code{close} is that of the library's 7232@code{pclose} function. 7233@end ignore 7234@c ENDOFRANGE ifc 7235@c ENDOFRANGE ofc 7236@c ENDOFRANGE pc 7237@c ENDOFRANGE cc 7238@c ENDOFRANGE prnt 7239 7240@node Expressions 7241@chapter Expressions 7242@c STARTOFRANGE exps 7243@cindex expressions 7244 7245Expressions are the basic building blocks of @command{awk} patterns 7246and actions. An expression evaluates to a value that you can print, test, 7247or pass to a function. Additionally, an expression 7248can assign a new value to a variable or a field by using an assignment operator. 7249 7250An expression can serve as a pattern or action statement on its own. 7251Most other kinds of 7252statements contain one or more expressions that specify the data on which to 7253operate. As in other languages, expressions in @command{awk} include 7254variables, array references, constants, and function calls, as well as 7255combinations of these with various operators. 7256 7257@menu 7258* Constants:: String, numeric and regexp constants. 7259* Using Constant Regexps:: When and how to use a regexp constant. 7260* Variables:: Variables give names to values for later use. 7261* Conversion:: The conversion of strings to numbers and vice 7262 versa. 7263* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-}, 7264 etc.) 7265* Concatenation:: Concatenating strings. 7266* Assignment Ops:: Changing the value of a variable or a field. 7267* Increment Ops:: Incrementing the numeric value of a variable. 7268* Truth Values:: What is ``true'' and what is ``false''. 7269* Typing and Comparison:: How variables acquire types and how this 7270 affects comparison of numbers and strings with 7271 @samp{<}, etc. 7272* Boolean Ops:: Combining comparison expressions using boolean 7273 operators @samp{||} (``or''), @samp{&&} 7274 (``and'') and @samp{!} (``not''). 7275* Conditional Exp:: Conditional expressions select between two 7276 subexpressions under control of a third 7277 subexpression. 7278* Function Calls:: A function call is an expression. 7279* Precedence:: How various operators nest. 7280@end menu 7281 7282@node Constants 7283@section Constant Expressions 7284@cindex constants, types of 7285 7286The simplest type of expression is the @dfn{constant}, which always has 7287the same value. There are three types of constants: numeric, 7288string, and regular expression. 7289 7290Each is used in the appropriate context when you need a data 7291value that isn't going to change. Numeric constants can 7292have different forms, but are stored identically internally. 7293 7294@menu 7295* Scalar Constants:: Numeric and string constants. 7296* Nondecimal-numbers:: What are octal and hex numbers. 7297* Regexp Constants:: Regular Expression constants. 7298@end menu 7299 7300@node Scalar Constants 7301@subsection Numeric and String Constants 7302 7303@cindex numeric, constants 7304A @dfn{numeric constant} stands for a number. This number can be an 7305integer, a decimal fraction, or a number in scientific (exponential) 7306notation.@footnote{The internal representation of all numbers, 7307including integers, uses double-precision 7308floating-point numbers. 7309On most modern systems, these are in IEEE 754 standard format.} 7310Here are some examples of numeric constants that all 7311have the same value: 7312 7313@example 7314105 73151.05e+2 73161050e-1 7317@end example 7318 7319@cindex string constants 7320A string constant consists of a sequence of characters enclosed in 7321double-quotation marks. For example: 7322 7323@example 7324"parrot" 7325@end example 7326 7327@noindent 7328@cindex differences in @command{awk} and @command{gawk}, strings 7329@cindex strings, length of 7330represents the string whose contents are @samp{parrot}. Strings in 7331@command{gawk} can be of any length, and they can contain any of the possible 7332eight-bit ASCII characters including ASCII @sc{nul} (character code zero). 7333Other @command{awk} 7334implementations may have difficulty with some character codes. 7335 7336@node Nondecimal-numbers 7337@subsection Octal and Hexadecimal Numbers 7338@cindex octal numbers 7339@cindex hexadecimal numbers 7340@cindex numbers, octal 7341@cindex numbers, hexadecimal 7342 7343In @command{awk}, all numbers are in decimal; i.e., base 10. Many other 7344programming languages allow you to specify numbers in other bases, often 7345octal (base 8) and hexadecimal (base 16). 7346In octal, the numbers go 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, etc. 7347Just as @samp{11}, in decimal, is 1 times 10 plus 1, so 7348@samp{11}, in octal, is 1 times 8, plus 1. This equals 9 in decimal. 7349In hexadecimal, there are 16 digits. Since the everyday decimal 7350number system only has ten digits (@samp{0}--@samp{9}), the letters 7351@samp{a} through @samp{f} are used to represent the rest. 7352(Case in the letters is usually irrelevant; hexadecimal @samp{a} and @samp{A} 7353have the same value.) 7354Thus, @samp{11}, in 7355hexadecimal, is 1 times 16 plus 1, which equals 17 in decimal. 7356 7357Just by looking at plain @samp{11}, you can't tell what base it's in. 7358So, in C, C++, and other languages derived from C, 7359@c such as PERL, but we won't mention that.... 7360there is a special notation to help signify the base. 7361Octal numbers start with a leading @samp{0}, 7362and hexadecimal numbers start with a leading @samp{0x} or @samp{0X}: 7363 7364@table @code 7365@item 11 7366Decimal value 11. 7367 7368@item 011 7369Octal 11, decimal value 9. 7370 7371@item 0x11 7372Hexadecimal 11, decimal value 17. 7373@end table 7374 7375This example shows the difference: 7376 7377@example 7378$ gawk 'BEGIN @{ printf "%d, %d, %d\n", 011, 11, 0x11 @}' 7379@print{} 9, 11, 17 7380@end example 7381 7382Being able to use octal and hexadecimal constants in your programs is most 7383useful when working with data that cannot be represented conveniently as 7384characters or as regular numbers, such as binary data of various sorts. 7385 7386@cindex @command{gawk}, octal numbers and 7387@cindex @command{gawk}, hexadecimal numbers and 7388@command{gawk} allows the use of octal and hexadecimal 7389constants in your program text. However, such numbers in the input data 7390are not treated differently; doing so by default would break old 7391programs. 7392(If you really need to do this, use the @option{--non-decimal-data} 7393command-line option; 7394@pxref{Nondecimal Data}.) 7395If you have octal or hexadecimal data, 7396you can use the @code{strtonum} function 7397(@pxref{String Functions}) 7398to convert the data into a number. 7399Most of the time, you will want to use octal or hexadecimal constants 7400when working with the built-in bit manipulation functions; 7401see @ref{Bitwise Functions}, 7402for more information. 7403 7404Unlike some early C implementations, @samp{8} and @samp{9} are not valid 7405in octal constants; e.g., @command{gawk} treats @samp{018} as decimal 18: 7406 7407@example 7408$ gawk 'BEGIN @{ print "021 is", 021 ; print 018 @}' 7409@print{} 021 is 17 7410@print{} 18 7411@end example 7412 7413@cindex compatibility mode (@command{gawk}), octal numbers 7414@cindex compatibility mode (@command{gawk}), hexadecimal numbers 7415Octal and hexadecimal source code constants are a @command{gawk} extension. 7416If @command{gawk} is in compatibility mode 7417(@pxref{Options}), 7418they are not available. 7419 7420@c fakenode --- for prepinfo 7421@subheading Advanced Notes: A Constant's Base Does Not Affect Its Value 7422@c comma before values does NOT start tertiary 7423@cindex advanced features, constants, values of 7424 7425Once a numeric constant has 7426been converted internally into a number, 7427@command{gawk} no longer remembers 7428what the original form of the constant was; the internal value is 7429always used. This has particular consequences for conversion of 7430numbers to strings: 7431 7432@example 7433$ gawk 'BEGIN @{ printf "0x11 is <%s>\n", 0x11 @}' 7434@print{} 0x11 is <17> 7435@end example 7436 7437@node Regexp Constants 7438@subsection Regular Expression Constants 7439 7440@c STARTOFRANGE rec 7441@cindex regexp constants 7442@cindex @code{~} (tilde), @code{~} operator 7443@cindex tilde (@code{~}), @code{~} operator 7444@cindex @code{!} (exclamation point), @code{!~} operator 7445@cindex exclamation point (@code{!}), @code{!~} operator 7446A regexp constant is a regular expression description enclosed in 7447slashes, such as @code{@w{/^beginning and end$/}}. Most regexps used in 7448@command{awk} programs are constant, but the @samp{~} and @samp{!~} 7449matching operators can also match computed or ``dynamic'' regexps 7450(which are just ordinary strings or variables that contain a regexp). 7451@c ENDOFRANGE cnst 7452 7453@node Using Constant Regexps 7454@section Using Regular Expression Constants 7455 7456@cindex dark corner, regexp constants 7457When used on the righthand side of the @samp{~} or @samp{!~} 7458operators, a regexp constant merely stands for the regexp that is to be 7459matched. 7460However, regexp constants (such as @code{/foo/}) may be used like simple expressions. 7461When a 7462regexp constant appears by itself, it has the same meaning as if it appeared 7463in a pattern, i.e., @samp{($0 ~ /foo/)} 7464@value{DARKCORNER} 7465@xref{Expression Patterns}. 7466This means that the following two code segments: 7467 7468@example 7469if ($0 ~ /barfly/ || $0 ~ /camelot/) 7470 print "found" 7471@end example 7472 7473@noindent 7474and: 7475 7476@example 7477if (/barfly/ || /camelot/) 7478 print "found" 7479@end example 7480 7481@noindent 7482are exactly equivalent. 7483One rather bizarre consequence of this rule is that the following 7484Boolean expression is valid, but does not do what the user probably 7485intended: 7486 7487@example 7488# note that /foo/ is on the left of the ~ 7489if (/foo/ ~ $1) print "found foo" 7490@end example 7491 7492@c @cindex automatic warnings 7493@c @cindex warnings, automatic 7494@cindex @command{gawk}, regexp constants and 7495@cindex regexp constants, in @command{gawk} 7496@noindent 7497This code is ``obviously'' testing @code{$1} for a match against the regexp 7498@code{/foo/}. But in fact, the expression @samp{/foo/ ~ $1} actually means 7499@samp{($0 ~ /foo/) ~ $1}. In other words, first match the input record 7500against the regexp @code{/foo/}. The result is either zero or one, 7501depending upon the success or failure of the match. That result 7502is then matched against the first field in the record. 7503Because it is unlikely that you would ever really want to make this kind of 7504test, @command{gawk} issues a warning when it sees this construct in 7505a program. 7506Another consequence of this rule is that the assignment statement: 7507 7508@example 7509matches = /foo/ 7510@end example 7511 7512@noindent 7513assigns either zero or one to the variable @code{matches}, depending 7514upon the contents of the current input record. 7515This feature of the language has never been well documented until the 7516POSIX specification. 7517 7518@cindex differences in @command{awk} and @command{gawk}, regexp constants 7519@cindex dark corner, regexp constants, as arguments to user-defined functions 7520@cindex @code{gensub} function (@command{gawk}) 7521@cindex @code{sub} function 7522@cindex @code{gsub} function 7523Constant regular expressions are also used as the first argument for 7524the @code{gensub}, @code{sub}, and @code{gsub} functions, and as the 7525second argument of the @code{match} function 7526(@pxref{String Functions}). 7527Modern implementations of @command{awk}, including @command{gawk}, allow 7528the third argument of @code{split} to be a regexp constant, but some 7529older implementations do not. 7530@value{DARKCORNER} 7531This can lead to confusion when attempting to use regexp constants 7532as arguments to user-defined functions 7533(@pxref{User-defined}). 7534For example: 7535 7536@example 7537function mysub(pat, repl, str, global) 7538@{ 7539 if (global) 7540 gsub(pat, repl, str) 7541 else 7542 sub(pat, repl, str) 7543 return str 7544@} 7545 7546@{ 7547 @dots{} 7548 text = "hi! hi yourself!" 7549 mysub(/hi/, "howdy", text, 1) 7550 @dots{} 7551@} 7552@end example 7553 7554@c @cindex automatic warnings 7555@c @cindex warnings, automatic 7556In this example, the programmer wants to pass a regexp constant to the 7557user-defined function @code{mysub}, which in turn passes it on to 7558either @code{sub} or @code{gsub}. However, what really happens is that 7559the @code{pat} parameter is either one or zero, depending upon whether 7560or not @code{$0} matches @code{/hi/}. 7561@command{gawk} issues a warning when it sees a regexp constant used as 7562a parameter to a user-defined function, since passing a truth value in 7563this way is probably not what was intended. 7564@c ENDOFRANGE rec 7565 7566@node Variables 7567@section Variables 7568 7569@cindex variables, user-defined 7570@cindex user-defined, variables 7571Variables are ways of storing values at one point in your program for 7572use later in another part of your program. They can be manipulated 7573entirely within the program text, and they can also be assigned values 7574on the @command{awk} command line. 7575 7576@menu 7577* Using Variables:: Using variables in your programs. 7578* Assignment Options:: Setting variables on the command-line and a 7579 summary of command-line syntax. This is an 7580 advanced method of input. 7581@end menu 7582 7583@node Using Variables 7584@subsection Using Variables in a Program 7585 7586Variables let you give names to values and refer to them later. Variables 7587have already been used in many of the examples. The name of a variable 7588must be a sequence of letters, digits, or underscores, and it may not begin 7589with a digit. Case is significant in variable names; @code{a} and @code{A} 7590are distinct variables. 7591 7592A variable name is a valid expression by itself; it represents the 7593variable's current value. Variables are given new values with 7594@dfn{assignment operators}, @dfn{increment operators}, and 7595@dfn{decrement operators}. 7596@xref{Assignment Ops}. 7597@c NEXT ED: Can also be changed by sub, gsub, split 7598 7599@cindex variables, built-in 7600@cindex variables, initializing 7601A few variables have special built-in meanings, such as @code{FS} (the 7602field separator), and @code{NF} (the number of fields in the current input 7603record). @xref{Built-in Variables}, for a list of the built-in variables. 7604These built-in variables can be used and assigned just like all other 7605variables, but their values are also used or changed automatically by 7606@command{awk}. All built-in variables' names are entirely uppercase. 7607 7608Variables in @command{awk} can be assigned either numeric or string values. 7609The kind of value a variable holds can change over the life of a program. 7610By default, variables are initialized to the empty string, which 7611is zero if converted to a number. There is no need to 7612``initialize'' each variable explicitly in @command{awk}, 7613which is what you would do in C and in most other traditional languages. 7614 7615@node Assignment Options 7616@subsection Assigning Variables on the Command Line 7617@cindex variables, assigning on command line 7618@c comma before assigning does NOT start tertiary 7619@cindex command line, variables, assigning on 7620 7621Any @command{awk} variable can be set by including a @dfn{variable assignment} 7622among the arguments on the command line when @command{awk} is invoked 7623(@pxref{Other Arguments}). 7624Such an assignment has the following form: 7625 7626@example 7627@var{variable}=@var{text} 7628@end example 7629 7630@c comma before assigning does NOT start tertiary 7631@cindex @code{-v} option, variables, assigning 7632@noindent 7633With it, a variable is set either at the beginning of the 7634@command{awk} run or in between input files. 7635When the assignment is preceded with the @option{-v} option, 7636as in the following: 7637 7638@example 7639-v @var{variable}=@var{text} 7640@end example 7641 7642@noindent 7643the variable is set at the very beginning, even before the 7644@code{BEGIN} rules are run. The @option{-v} option and its assignment 7645must precede all the @value{FN} arguments, as well as the program text. 7646(@xref{Options}, for more information about 7647the @option{-v} option.) 7648Otherwise, the variable assignment is performed at a time determined by 7649its position among the input file arguments---after the processing of the 7650preceding input file argument. For example: 7651 7652@example 7653awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list 7654@end example 7655 7656@noindent 7657prints the value of field number @code{n} for all input records. Before 7658the first file is read, the command line sets the variable @code{n} 7659equal to four. This causes the fourth field to be printed in lines from 7660the file @file{inventory-shipped}. After the first file has finished, 7661but before the second file is started, @code{n} is set to two, so that the 7662second field is printed in lines from @file{BBS-list}: 7663 7664@example 7665$ awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list 7666@print{} 15 7667@print{} 24 7668@dots{} 7669@print{} 555-5553 7670@print{} 555-3412 7671@dots{} 7672@end example 7673 7674@cindex dark corner, command-line arguments 7675Command-line arguments are made available for explicit examination by 7676the @command{awk} program in the @code{ARGV} array 7677(@pxref{ARGC and ARGV}). 7678@command{awk} processes the values of command-line assignments for escape 7679sequences 7680(@pxref{Escape Sequences}). 7681@value{DARKCORNER} 7682 7683@node Conversion 7684@section Conversion of Strings and Numbers 7685 7686@cindex converting, strings to numbers 7687@cindex strings, converting 7688@cindex numbers, converting 7689@cindex converting, numbers 7690Strings are converted to numbers and numbers are converted to strings, if the context 7691of the @command{awk} program demands it. For example, if the value of 7692either @code{foo} or @code{bar} in the expression @samp{foo + bar} 7693happens to be a string, it is converted to a number before the addition 7694is performed. If numeric values appear in string concatenation, they 7695are converted to strings. Consider the following: 7696 7697@example 7698two = 2; three = 3 7699print (two three) + 4 7700@end example 7701 7702@noindent 7703This prints the (numeric) value 27. The numeric values of 7704the variables @code{two} and @code{three} are converted to strings and 7705concatenated together. The resulting string is converted back to the 7706number 23, to which 4 is then added. 7707 7708@cindex null strings, converting numbers to strings 7709@cindex type conversion 7710If, for some reason, you need to force a number to be converted to a 7711string, concatenate the empty string, @code{""}, with that number. 7712To force a string to be converted to a number, add zero to that string. 7713A string is converted to a number by interpreting any numeric prefix 7714of the string as numerals: 7715@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1000, and @code{"25fix"} 7716has a numeric value of 25. 7717Strings that can't be interpreted as valid numbers convert to zero. 7718 7719@cindex @code{CONVFMT} variable 7720The exact manner in which numbers are converted into strings is controlled 7721by the @command{awk} built-in variable @code{CONVFMT} (@pxref{Built-in Variables}). 7722Numbers are converted using the @code{sprintf} function 7723with @code{CONVFMT} as the format 7724specifier 7725(@pxref{String Functions}). 7726 7727@code{CONVFMT}'s default value is @code{"%.6g"}, which prints a value with 7728at least six significant digits. For some applications, you might want to 7729change it to specify more precision. 7730On most modern machines, 773117 digits is enough to capture a floating-point number's 7732value exactly, 7733most of the time.@footnote{Pathological cases can require up to 7734752 digits (!), but we doubt that you need to worry about this.} 7735 7736@cindex dark corner, @code{CONVFMT} variable 7737Strange results can occur if you set @code{CONVFMT} to a string that doesn't 7738tell @code{sprintf} how to format floating-point numbers in a useful way. 7739For example, if you forget the @samp{%} in the format, @command{awk} converts 7740all numbers to the same constant string. 7741As a special case, if a number is an integer, then the result of converting 7742it to a string is @emph{always} an integer, no matter what the value of 7743@code{CONVFMT} may be. Given the following code fragment: 7744 7745@example 7746CONVFMT = "%2.2f" 7747a = 12 7748b = a "" 7749@end example 7750 7751@noindent 7752@code{b} has the value @code{"12"}, not @code{"12.00"}. 7753@value{DARKCORNER} 7754 7755@cindex POSIX @command{awk}, @code{OFMT} variable and 7756@cindex @code{OFMT} variable 7757@cindex portability, new @command{awk} vs. old @command{awk} 7758@cindex @command{awk}, new vs. old, @code{OFMT} variable 7759Prior to the POSIX standard, @command{awk} used the value 7760of @code{OFMT} for converting numbers to strings. @code{OFMT} 7761specifies the output format to use when printing numbers with @code{print}. 7762@code{CONVFMT} was introduced in order to separate the semantics of 7763conversion from the semantics of printing. Both @code{CONVFMT} and 7764@code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority 7765of cases, old @command{awk} programs do not change their behavior. 7766However, these semantics for @code{OFMT} are something to keep in mind if you must 7767port your new style program to older implementations of @command{awk}. 7768We recommend 7769that instead of changing your programs, just port @command{gawk} itself. 7770@xref{Print}, 7771for more information on the @code{print} statement. 7772 7773Finally, once again, where you are can matter when it comes to 7774converting between numbers and strings. In 7775@ref{Locales}, we mentioned that the 7776local character set and language (the locale) can affect how @command{gawk} matches 7777characters. The locale also affects numeric formats. In particular, for @command{awk} 7778programs, it affects the decimal point character. The @code{"C"} locale, and most 7779English-language locales, use the period character (@samp{.}) as the decimal point. 7780However, many (if not most) European and non-English locales use the comma (@samp{,}) 7781as the decimal point character. 7782 7783The POSIX standard says that @command{awk} always uses the period as the decimal 7784point when reading the @command{awk} program source code, and for command-line 7785variable assignments (@pxref{Other Arguments}). 7786However, when interpreting input data, for @code{print} and @code{printf} output, 7787and for number to string conversion, the local decimal point character is used. 7788As of @value{PVERSION} 3.1.3, @command{gawk} fully complies with this aspect 7789of the standard. Here are some examples indicating the difference in behavior, 7790on a GNU/Linux system: 7791 7792@example 7793$ gawk 'BEGIN @{ printf "%g\n", 3.1415927 @}' 7794@print{} 3.14159 7795$ LC_ALL=en_DK gawk 'BEGIN @{ printf "%g\n", 3.1415927 @}' 7796@print{} 3,14159 7797$ echo 4,321 | gawk '@{ print $1 + 1 @}' 7798@print{} 5 7799$ echo 4,321 | LC_ALL=en_DK gawk '@{ print $1 + 1 @}' 7800@print{} 5,321 7801@end example 7802 7803@noindent 7804The @samp{en_DK} locale is for English in Denmark, where the comma acts as 7805the decimal point separator. In the normal @code{"C"} locale, @command{gawk} 7806treats @samp{4,321} as @samp{4}, while in the Danish locale, it's treated 7807as the full number, @samp{4.321}. 7808 7809@node Arithmetic Ops 7810@section Arithmetic Operators 7811@cindex arithmetic operators 7812@cindex operators, arithmetic 7813@c @cindex addition 7814@c @cindex subtraction 7815@c @cindex multiplication 7816@c @cindex division 7817@c @cindex remainder 7818@c @cindex quotient 7819@c @cindex exponentiation 7820 7821The @command{awk} language uses the common arithmetic operators when 7822evaluating expressions. All of these arithmetic operators follow normal 7823precedence rules and work as you would expect them to. 7824 7825The following example uses a file named @file{grades}, which contains 7826a list of student names as well as three test scores per student (it's 7827a small class): 7828 7829@example 7830Pat 100 97 58 7831Sandy 84 72 93 7832Chris 72 92 89 7833@end example 7834 7835@noindent 7836This programs takes the file @file{grades} and prints the average 7837of the scores: 7838 7839@example 7840$ awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3 7841> print $1, avg @}' grades 7842@print{} Pat 85 7843@print{} Sandy 83 7844@print{} Chris 84.3333 7845@end example 7846 7847The following list provides the arithmetic operators in @command{awk}, in order from 7848the highest precedence to the lowest: 7849 7850@table @code 7851@item - @var{x} 7852Negation. 7853 7854@item + @var{x} 7855Unary plus; the expression is converted to a number. 7856 7857@cindex POSIX @command{awk}, arithmetic operators and 7858@item @var{x} ^ @var{y} 7859@itemx @var{x} ** @var{y} 7860Exponentiation; @var{x} raised to the @var{y} power. @samp{2 ^ 3} has 7861the value eight; the character sequence @samp{**} is equivalent to 7862@samp{^}. 7863 7864@item @var{x} * @var{y} 7865Multiplication. 7866 7867@cindex troubleshooting, division 7868@cindex division 7869@item @var{x} / @var{y} 7870Division; because all numbers in @command{awk} are floating-point 7871numbers, the result is @emph{not} rounded to an integer---@samp{3 / 4} has 7872the value 0.75. (It is a common mistake, especially for C programmers, 7873to forget that @emph{all} numbers in @command{awk} are floating-point, 7874and that division of integer-looking constants produces a real number, 7875not an integer.) 7876 7877@item @var{x} % @var{y} 7878Remainder; further discussion is provided in the text, just 7879after this list. 7880 7881@item @var{x} + @var{y} 7882Addition. 7883 7884@item @var{x} - @var{y} 7885Subtraction. 7886@end table 7887 7888Unary plus and minus have the same precedence, 7889the multiplication operators all have the same precedence, and 7890addition and subtraction have the same precedence. 7891 7892@cindex differences in @command{awk} and @command{gawk}, trunc-mod operation 7893@cindex trunc-mod operation 7894When computing the remainder of @code{@var{x} % @var{y}}, 7895the quotient is rounded toward zero to an integer and 7896multiplied by @var{y}. This result is subtracted from @var{x}; 7897this operation is sometimes known as ``trunc-mod.'' The following 7898relation always holds: 7899 7900@example 7901b * int(a / b) + (a % b) == a 7902@end example 7903 7904One possibly undesirable effect of this definition of remainder is that 7905@code{@var{x} % @var{y}} is negative if @var{x} is negative. Thus: 7906 7907@example 7908-17 % 8 = -1 7909@end example 7910 7911In other @command{awk} implementations, the signedness of the remainder 7912may be machine-dependent. 7913@c !!! what does posix say? 7914 7915@cindex portability, @code{**} operator and 7916@cindex @code{*} (asterisk), @code{**} operator 7917@cindex asterisk (@code{*}), @code{**} operator 7918@strong{Note:} 7919The POSIX standard only specifies the use of @samp{^} 7920for exponentiation. 7921For maximum portability, do not use the @samp{**} operator. 7922 7923@node Concatenation 7924@section String Concatenation 7925@cindex Kernighan, Brian 7926@quotation 7927@i{It seemed like a good idea at the time.}@* 7928Brian Kernighan 7929@end quotation 7930 7931@cindex string operators 7932@cindex operators, string 7933@cindex concatenating 7934There is only one string operation: concatenation. It does not have a 7935specific operator to represent it. Instead, concatenation is performed by 7936writing expressions next to one another, with no operator. For example: 7937 7938@example 7939$ awk '@{ print "Field number one: " $1 @}' BBS-list 7940@print{} Field number one: aardvark 7941@print{} Field number one: alpo-net 7942@dots{} 7943@end example 7944 7945Without the space in the string constant after the @samp{:}, the line 7946runs together. For example: 7947 7948@example 7949$ awk '@{ print "Field number one:" $1 @}' BBS-list 7950@print{} Field number one:aardvark 7951@print{} Field number one:alpo-net 7952@dots{} 7953@end example 7954 7955@cindex troubleshooting, string concatenation 7956Because string concatenation does not have an explicit operator, it is 7957often necessary to insure that it happens at the right time by using 7958parentheses to enclose the items to concatenate. For example, the 7959following code fragment does not concatenate @code{file} and @code{name} 7960as you might expect: 7961 7962@example 7963file = "file" 7964name = "name" 7965print "something meaningful" > file name 7966@end example 7967 7968@noindent 7969It is necessary to use the following: 7970 7971@example 7972print "something meaningful" > (file name) 7973@end example 7974 7975@cindex order of evaluation, concatenation 7976@cindex evaluation order, concatenation 7977@cindex side effects 7978Parentheses should be used around concatenation in all but the 7979most common contexts, such as on the righthand side of @samp{=}. 7980Be careful about the kinds of expressions used in string concatenation. 7981In particular, the order of evaluation of expressions used for concatenation 7982is undefined in the @command{awk} language. Consider this example: 7983 7984@example 7985BEGIN @{ 7986 a = "don't" 7987 print (a " " (a = "panic")) 7988@} 7989@end example 7990 7991@noindent 7992It is not defined whether the assignment to @code{a} happens 7993before or after the value of @code{a} is retrieved for producing the 7994concatenated value. The result could be either @samp{don't panic}, 7995or @samp{panic panic}. 7996@c see test/nasty.awk for a worse example 7997The precedence of concatenation, when mixed with other operators, is often 7998counter-intuitive. Consider this example: 7999 8000@ignore 8001> To: bug-gnu-utils@@gnu.org 8002> CC: arnold@gnu.org 8003> Subject: gawk 3.0.4 bug with {print -12 " " -24} 8004> From: Russell Schulz <Russell_Schulz@locutus.ofB.ORG> 8005> Date: Tue, 8 Feb 2000 19:56:08 -0700 8006> 8007> gawk 3.0.4 on NT gives me: 8008> 8009> prompt> cat bad.awk 8010> BEGIN { print -12 " " -24; } 8011> 8012> prompt> gawk -f bad.awk 8013> -12-24 8014> 8015> when I would expect 8016> 8017> -12 -24 8018> 8019> I have not investigated the source, or other implementations. The 8020> bug is there on my NT and DOS versions 2.15.6 . 8021@end ignore 8022 8023@example 8024$ awk 'BEGIN @{ print -12 " " -24 @}' 8025@print{} -12-24 8026@end example 8027 8028This ``obviously'' is concatenating @minus{}12, a space, and @minus{}24. 8029But where did the space disappear to? 8030The answer lies in the combination of operator precedences and 8031@command{awk}'s automatic conversion rules. To get the desired result, 8032write the program in the following manner: 8033 8034@example 8035$ awk 'BEGIN @{ print -12 " " (-24) @}' 8036@print{} -12 -24 8037@end example 8038 8039This forces @command{awk} to treat the @samp{-} on the @samp{-24} as unary. 8040Otherwise, it's parsed as follows: 8041 8042@display 8043 @minus{}12 (@code{"@ "} @minus{} 24) 8044@result{} @minus{}12 (0 @minus{} 24) 8045@result{} @minus{}12 (@minus{}24) 8046@result{} @minus{}12@minus{}24 8047@end display 8048 8049As mentioned earlier, 8050when doing concatenation, @emph{parenthesize}. Otherwise, 8051you're never quite sure what you'll get. 8052 8053@node Assignment Ops 8054@section Assignment Expressions 8055@c STARTOFRANGE asop 8056@cindex assignment operators 8057@c STARTOFRANGE opas 8058@cindex operators, assignment 8059@c STARTOFRANGE exas 8060@cindex expressions, assignment 8061@cindex @code{=} (equals sign), @code{=} operator 8062@cindex equals sign (@code{=}), @code{=} operator 8063An @dfn{assignment} is an expression that stores a (usually different) 8064value into a variable. For example, let's assign the value one to the variable 8065@code{z}: 8066 8067@example 8068z = 1 8069@end example 8070 8071After this expression is executed, the variable @code{z} has the value one. 8072Whatever old value @code{z} had before the assignment is forgotten. 8073 8074Assignments can also store string values. For example, the 8075following stores 8076the value @code{"this food is good"} in the variable @code{message}: 8077 8078@example 8079thing = "food" 8080predicate = "good" 8081message = "this " thing " is " predicate 8082@end example 8083 8084@noindent 8085@cindex side effects, assignment expressions 8086This also illustrates string concatenation. 8087The @samp{=} sign is called an @dfn{assignment operator}. It is the 8088simplest assignment operator because the value of the righthand 8089operand is stored unchanged. 8090Most operators (addition, concatenation, and so on) have no effect 8091except to compute a value. If the value isn't used, there's no reason to 8092use the operator. An assignment operator is different; it does 8093produce a value, but even if you ignore it, the assignment still 8094makes itself felt through the alteration of the variable. We call this 8095a @dfn{side effect}. 8096 8097@cindex lvalues/rvalues 8098@cindex rvalues/lvalues 8099@cindex assignment operators, lvalues/rvalues 8100@cindex operators, assignment 8101The lefthand operand of an assignment need not be a variable 8102(@pxref{Variables}); it can also be a field 8103(@pxref{Changing Fields}) or 8104an array element (@pxref{Arrays}). 8105These are all called @dfn{lvalues}, 8106which means they can appear on the lefthand side of an assignment operator. 8107The righthand operand may be any expression; it produces the new value 8108that the assignment stores in the specified variable, field, or array 8109element. (Such values are called @dfn{rvalues}.) 8110 8111@cindex variables, types of 8112It is important to note that variables do @emph{not} have permanent types. 8113A variable's type is simply the type of whatever value it happens 8114to hold at the moment. In the following program fragment, the variable 8115@code{foo} has a numeric value at first, and a string value later on: 8116 8117@example 8118foo = 1 8119print foo 8120foo = "bar" 8121print foo 8122@end example 8123 8124@noindent 8125When the second assignment gives @code{foo} a string value, the fact that 8126it previously had a numeric value is forgotten. 8127 8128String values that do not begin with a digit have a numeric value of 8129zero. After executing the following code, the value of @code{foo} is five: 8130 8131@example 8132foo = "a string" 8133foo = foo + 5 8134@end example 8135 8136@noindent 8137@strong{Note:} Using a variable as a number and then later as a string 8138can be confusing and is poor programming style. The previous two examples 8139illustrate how @command{awk} works, @emph{not} how you should write your 8140programs! 8141 8142An assignment is an expression, so it has a value---the same value that 8143is assigned. Thus, @samp{z = 1} is an expression with the value one. 8144One consequence of this is that you can write multiple assignments together, 8145such as: 8146 8147@example 8148x = y = z = 5 8149@end example 8150 8151@noindent 8152This example stores the value five in all three variables 8153(@code{x}, @code{y}, and @code{z}). 8154It does so because the 8155value of @samp{z = 5}, which is five, is stored into @code{y} and then 8156the value of @samp{y = z = 5}, which is five, is stored into @code{x}. 8157 8158Assignments may be used anywhere an expression is called for. For 8159example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one, 8160and then test whether @code{x} equals one. But this style tends to make 8161programs hard to read; such nesting of assignments should be avoided, 8162except perhaps in a one-shot program. 8163 8164@cindex @code{+} (plus sign), @code{+=} operator 8165@cindex plus sign (@code{+}), @code{+=} operator 8166Aside from @samp{=}, there are several other assignment operators that 8167do arithmetic with the old value of the variable. For example, the 8168operator @samp{+=} computes a new value by adding the righthand value 8169to the old value of the variable. Thus, the following assignment adds 8170five to the value of @code{foo}: 8171 8172@example 8173foo += 5 8174@end example 8175 8176@noindent 8177This is equivalent to the following: 8178 8179@example 8180foo = foo + 5 8181@end example 8182 8183@noindent 8184Use whichever makes the meaning of your program clearer. 8185 8186There are situations where using @samp{+=} (or any assignment operator) 8187is @emph{not} the same as simply repeating the lefthand operand in the 8188righthand expression. For example: 8189 8190@cindex Rankin, Pat 8191@example 8192# Thanks to Pat Rankin for this example 8193BEGIN @{ 8194 foo[rand()] += 5 8195 for (x in foo) 8196 print x, foo[x] 8197 8198 bar[rand()] = bar[rand()] + 5 8199 for (x in bar) 8200 print x, bar[x] 8201@} 8202@end example 8203 8204@cindex operators, assignment, evaluation order 8205@cindex assignment operators, evaluation order 8206@noindent 8207The indices of @code{bar} are practically guaranteed to be different, because 8208@code{rand} returns different values each time it is called. 8209(Arrays and the @code{rand} function haven't been covered yet. 8210@xref{Arrays}, 8211and see @ref{Numeric Functions}, for more information). 8212This example illustrates an important fact about assignment 8213operators: the lefthand expression is only evaluated @emph{once}. 8214It is up to the implementation as to which expression is evaluated 8215first, the lefthand or the righthand. 8216Consider this example: 8217 8218@example 8219i = 1 8220a[i += 2] = i + 1 8221@end example 8222 8223@noindent 8224The value of @code{a[3]} could be either two or four. 8225 8226Here is a table of the arithmetic assignment operators. In each 8227case, the righthand operand is an expression whose value is converted 8228to a number. 8229 8230@ignore 8231@table @code 8232@item @var{lvalue} += @var{increment} 8233Adds @var{increment} to the value of @var{lvalue}. 8234 8235@item @var{lvalue} -= @var{decrement} 8236Subtracts @var{decrement} from the value of @var{lvalue}. 8237 8238@item @var{lvalue} *= @var{coefficient} 8239Multiplies the value of @var{lvalue} by @var{coefficient}. 8240 8241@item @var{lvalue} /= @var{divisor} 8242Divides the value of @var{lvalue} by @var{divisor}. 8243 8244@item @var{lvalue} %= @var{modulus} 8245Sets @var{lvalue} to its remainder by @var{modulus}. 8246 8247@cindex @command{awk} language, POSIX version 8248@cindex POSIX @command{awk} 8249@item @var{lvalue} ^= @var{power} 8250@itemx @var{lvalue} **= @var{power} 8251Raises @var{lvalue} to the power @var{power}. 8252(Only the @samp{^=} operator is specified by POSIX.) 8253@end table 8254@end ignore 8255 8256@cindex @code{-} (hyphen), @code{-=} operator 8257@cindex hyphen (@code{-}), @code{-=} operator 8258@cindex @code{*} (asterisk), @code{*=} operator 8259@cindex asterisk (@code{*}), @code{*=} operator 8260@cindex @code{/} (forward slash), @code{/=} operator 8261@cindex forward slash (@code{/}), @code{/=} operator 8262@cindex @code{%} (percent sign), @code{%=} operator 8263@cindex percent sign (@code{%}), @code{%=} operator 8264@cindex @code{^} (caret), @code{^=} operator 8265@cindex caret (@code{^}), @code{^=} operator 8266@cindex @code{*} (asterisk), @code{**=} operator 8267@cindex asterisk (@code{*}), @code{**=} operator 8268@multitable {@var{lvalue} *= @var{coefficient}} {Subtracts @var{decrement} from the value of @var{lvalue}.} 8269@item @var{lvalue} @code{+=} @var{increment} @tab Adds @var{increment} to the value of @var{lvalue}. 8270 8271@item @var{lvalue} @code{-=} @var{decrement} @tab Subtracts @var{decrement} from the value of @var{lvalue}. 8272 8273@item @var{lvalue} @code{*=} @var{coefficient} @tab Multiplies the value of @var{lvalue} by @var{coefficient}. 8274 8275@item @var{lvalue} @code{/=} @var{divisor} @tab Divides the value of @var{lvalue} by @var{divisor}. 8276 8277@item @var{lvalue} @code{%=} @var{modulus} @tab Sets @var{lvalue} to its remainder by @var{modulus}. 8278 8279@cindex @command{awk} language, POSIX version 8280@cindex POSIX @command{awk} 8281@item @var{lvalue} @code{^=} @var{power} @tab 8282@item @var{lvalue} @code{**=} @var{power} @tab Raises @var{lvalue} to the power @var{power}. 8283@end multitable 8284 8285@cindex POSIX @command{awk}, @code{**=} operator and 8286@cindex portability, @code{**=} operator and 8287@strong{Note:} 8288Only the @samp{^=} operator is specified by POSIX. 8289For maximum portability, do not use the @samp{**=} operator. 8290 8291@c fakenode --- for prepinfo 8292@subheading Advanced Notes: Syntactic Ambiguities Between @samp{/=} and Regular Expressions 8293@cindex advanced features, regexp constants 8294@cindex dark corner, regexp constants, @code{/=} operator and 8295@cindex @code{/} (forward slash), @code{/=} operator, vs. @code{/=@dots{}/} regexp constant 8296@cindex forward slash (@code{/}), @code{/=} operator, vs. @code{/=@dots{}/} regexp constant 8297@cindex regexp constants, @code{/=@dots{}/}, @code{/=} operator and 8298 8299@c derived from email from "Nelson H. F. Beebe" <beebe@math.utah.edu> 8300@c Date: Mon, 1 Sep 1997 13:38:35 -0600 (MDT) 8301 8302@cindex dark corner 8303@cindex ambiguity, syntactic: @code{/=} operator vs. @code{/=@dots{}/} regexp constant 8304@cindex syntactic ambiguity: @code{/=} operator vs. @code{/=@dots{}/} regexp constant 8305@cindex @code{/=} operator vs. @code{/=@dots{}/} regexp constant 8306There is a syntactic ambiguity between the @samp{/=} assignment 8307operator and regexp constants whose first character is an @samp{=}. 8308@value{DARKCORNER} 8309This is most notable in commercial @command{awk} versions. 8310For example: 8311 8312@example 8313$ awk /==/ /dev/null 8314@error{} awk: syntax error at source line 1 8315@error{} context is 8316@error{} >>> /= <<< 8317@error{} awk: bailing out at source line 1 8318@end example 8319 8320@noindent 8321A workaround is: 8322 8323@example 8324awk '/[=]=/' /dev/null 8325@end example 8326 8327@command{gawk} does not have this problem, 8328nor do the other 8329freely available versions described in 8330@ref{Other Versions}. 8331@c ENDOFRANGE exas 8332@c ENDOFRANGE opas 8333@c ENDOFRANGE asop 8334 8335@node Increment Ops 8336@section Increment and Decrement Operators 8337 8338@c STARTOFRANGE inop 8339@cindex increment operators 8340@c STARTOFRANGE opde 8341@cindex operators, decrement/increment 8342@dfn{Increment} and @dfn{decrement operators} increase or decrease the value of 8343a variable by one. An assignment operator can do the same thing, so 8344the increment operators add no power to the @command{awk} language; however, they 8345are convenient abbreviations for very common operations. 8346 8347@cindex side effects 8348@cindex @code{+} (plus sign), decrement/increment operators 8349@cindex plus sign (@code{+}), decrement/increment operators 8350@cindex side effects, decrement/increment operators 8351The operator used for adding one is written @samp{++}. It can be used to increment 8352a variable either before or after taking its value. 8353To pre-increment a variable @code{v}, write @samp{++v}. This adds 8354one to the value of @code{v}---that new value is also the value of the 8355expression. (The assignment expression @samp{v += 1} is completely 8356equivalent.) 8357Writing the @samp{++} after the variable specifies post-increment. This 8358increments the variable value just the same; the difference is that the 8359value of the increment expression itself is the variable's @emph{old} 8360value. Thus, if @code{foo} has the value four, then the expression @samp{foo++} 8361has the value four, but it changes the value of @code{foo} to five. 8362In other words, the operator returns the old value of the variable, 8363but with the side effect of incrementing it. 8364 8365The post-increment @samp{foo++} is nearly the same as writing @samp{(foo 8366+= 1) - 1}. It is not perfectly equivalent because all numbers in 8367@command{awk} are floating-point---in floating-point, @samp{foo + 1 - 1} does 8368not necessarily equal @code{foo}. But the difference is minute as 8369long as you stick to numbers that are fairly small (less than 10e12). 8370 8371@cindex @code{$} (dollar sign), incrementing fields and arrays 8372@cindex dollar sign (@code{$}), incrementing fields and arrays 8373Fields and array elements are incremented 8374just like variables. (Use @samp{$(i++)} when you want to do a field reference 8375and a variable increment at the same time. The parentheses are necessary 8376because of the precedence of the field reference operator @samp{$}.) 8377 8378@cindex decrement operators 8379The decrement operator @samp{--} works just like @samp{++}, except that 8380it subtracts one instead of adding it. As with @samp{++}, it can be used before 8381the lvalue to pre-decrement or after it to post-decrement. 8382Following is a summary of increment and decrement expressions: 8383 8384@table @code 8385@cindex @code{+} (plus sign), @code{++} operator 8386@cindex plus sign (@code{+}), @code{++} operator 8387@item ++@var{lvalue} 8388This expression increments @var{lvalue}, and the new value becomes the 8389value of the expression. 8390 8391@item @var{lvalue}++ 8392This expression increments @var{lvalue}, but 8393the value of the expression is the @emph{old} value of @var{lvalue}. 8394 8395@cindex @code{-} (hyphen), @code{--} operator 8396@cindex hyphen (@code{-}), @code{--} operator 8397@item --@var{lvalue} 8398This expression is 8399like @samp{++@var{lvalue}}, but instead of adding, it subtracts. It 8400decrements @var{lvalue} and delivers the value that is the result. 8401 8402@item @var{lvalue}-- 8403This expression is 8404like @samp{@var{lvalue}++}, but instead of adding, it subtracts. It 8405decrements @var{lvalue}. The value of the expression is the @emph{old} 8406value of @var{lvalue}. 8407@end table 8408 8409@c fakenode --- for prepinfo 8410@subheading Advanced Notes: Operator Evaluation Order 8411@c comma before precedence does NOT start tertiary 8412@cindex advanced features, operators, precedence 8413@cindex precedence 8414@cindex operators, precedence 8415@cindex portability, operators 8416@cindex evaluation order 8417@cindex Marx, Groucho 8418@quotation 8419@i{Doctor, doctor! It hurts when I do this!@* 8420So don't do that!}@* 8421Groucho Marx 8422@end quotation 8423 8424@noindent 8425What happens for something like the following? 8426 8427@example 8428b = 6 8429print b += b++ 8430@end example 8431 8432@noindent 8433Or something even stranger? 8434 8435@example 8436b = 6 8437b += ++b + b++ 8438print b 8439@end example 8440 8441@cindex side effects 8442In other words, when do the various side effects prescribed by the 8443postfix operators (@samp{b++}) take effect? 8444When side effects happen is @dfn{implementation defined}. 8445In other words, it is up to the particular version of @command{awk}. 8446The result for the first example may be 12 or 13, and for the second, it 8447may be 22 or 23. 8448 8449In short, doing things like this is not recommended and definitely 8450not anything that you can rely upon for portability. 8451You should avoid such things in your own programs. 8452@c You'll sleep better at night and be able to look at yourself 8453@c in the mirror in the morning. 8454@c ENDOFRANGE inop 8455@c ENDOFRANGE opde 8456@c ENDOFRANGE deop 8457 8458@node Truth Values 8459@section True and False in @command{awk} 8460@cindex truth values 8461@cindex logical false/true 8462@cindex false, logical 8463@cindex true, logical 8464 8465@cindex null strings 8466Many programming languages have a special representation for the concepts 8467of ``true'' and ``false.'' Such languages usually use the special 8468constants @code{true} and @code{false}, or perhaps their uppercase 8469equivalents. 8470However, @command{awk} is different. 8471It borrows a very simple concept of true and 8472false from C. In @command{awk}, any nonzero numeric value @emph{or} any 8473nonempty string value is true. Any other value (zero or the null 8474string @code{""}) is false. The following program prints @samp{A strange 8475truth value} three times: 8476 8477@example 8478BEGIN @{ 8479 if (3.1415927) 8480 print "A strange truth value" 8481 if ("Four Score And Seven Years Ago") 8482 print "A strange truth value" 8483 if (j = 57) 8484 print "A strange truth value" 8485@} 8486@end example 8487 8488@cindex dark corner 8489There is a surprising consequence of the ``nonzero or non-null'' rule: 8490the string constant @code{"0"} is actually true, because it is non-null. 8491@value{DARKCORNER} 8492 8493@node Typing and Comparison 8494@section Variable Typing and Comparison Expressions 8495@quotation 8496@i{The Guide is definitive. Reality is frequently inaccurate.}@* 8497The Hitchhiker's Guide to the Galaxy 8498@end quotation 8499 8500@c STARTOFRANGE comex 8501@cindex comparison expressions 8502@c STARTOFRANGE excom 8503@cindex expressions, comparison 8504@cindex expressions, matching, See comparison expressions 8505@cindex matching, expressions, See comparison expressions 8506@cindex relational operators, See comparison operators 8507@c comma is part of See 8508@cindex operators, relational, See operators, comparison 8509@c STARTOFRANGE varting 8510@cindex variable typing 8511@c STARTOFRANGE vartypc 8512@cindex variables, types of, comparison expressions and 8513Unlike other programming languages, @command{awk} variables do not have a 8514fixed type. Instead, they can be either a number or a string, depending 8515upon the value that is assigned to them. 8516 8517@cindex numeric, strings 8518@cindex strings, numeric 8519@cindex POSIX @command{awk}, numeric strings and 8520The 1992 POSIX standard introduced 8521the concept of a @dfn{numeric string}, which is simply a string that looks 8522like a number---for example, @code{@w{" +2"}}. This concept is used 8523for determining the type of a variable. 8524The type of the variable is important because the types of two variables 8525determine how they are compared. 8526In @command{gawk}, variable typing follows these rules: 8527 8528@itemize @bullet 8529@item 8530A numeric constant or the result of a numeric operation has the @var{numeric} 8531attribute. 8532 8533@item 8534A string constant or the result of a string operation has the @var{string} 8535attribute. 8536 8537@item 8538Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements, 8539@code{ENVIRON} elements, and the 8540elements of an array created by @code{split} that are numeric strings 8541have the @var{strnum} attribute. Otherwise, they have the @var{string} 8542attribute. 8543Uninitialized variables also have the @var{strnum} attribute. 8544 8545@item 8546Attributes propagate across assignments but are not changed by 8547any use. 8548@c (Although a use may cause the entity to acquire an additional 8549@c value such that it has both a numeric and string value, this leaves the 8550@c attribute unchanged.) 8551@c This is important but not relevant 8552@end itemize 8553 8554The last rule is particularly important. In the following program, 8555@code{a} has numeric type, even though it is later used in a string 8556operation: 8557 8558@example 8559BEGIN @{ 8560 a = 12.345 8561 b = a " is a cute number" 8562 print b 8563@} 8564@end example 8565 8566When two operands are compared, either string comparison or numeric comparison 8567may be used. This depends upon the attributes of the operands, according to the 8568following symmetric matrix: 8569 8570@c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables 8571@tex 8572\centerline{ 8573\vbox{\bigskip % space above the table (about 1 linespace) 8574% Because we have vertical rules, we can't let TeX insert interline space 8575% in its usual way. 8576\offinterlineskip 8577% 8578% Define the table template. & separates columns, and \cr ends the 8579% template (and each row). # is replaced by the text of that entry on 8580% each row. The template for the first column breaks down like this: 8581% \strut -- a way to make each line have the height and depth 8582% of a normal line of type, since we turned off interline spacing. 8583% \hfil -- infinite glue; has the effect of right-justifying in this case. 8584% # -- replaced by the text (for instance, `STRNUM', in the last row). 8585% \quad -- about the width of an `M'. Just separates the columns. 8586% 8587% The second column (\vrule#) is what generates the vertical rule that 8588% spans table rows. 8589% 8590% The doubled && before the next entry means `repeat the following 8591% template as many times as necessary on each line' -- in our case, twice. 8592% 8593% The template itself, \quad#\hfil, left-justifies with a little space before. 8594% 8595\halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr 8596 &&STRING &NUMERIC &STRNUM\cr 8597% The \omit tells TeX to skip inserting the template for this column on 8598% this particular row. In this case, we only want a little extra space 8599% to separate the heading row from the rule below it. the depth 2pt -- 8600% `\vrule depth 2pt' is that little space. 8601\omit &depth 2pt\cr 8602% This is the horizontal rule below the heading. Since it has nothing to 8603% do with the columns of the table, we use \noalign to get it in there. 8604\noalign{\hrule} 8605% Like above, this time a little more space. 8606\omit &depth 4pt\cr 8607% The remaining rows have nothing special about them. 8608STRING &&string &string &string\cr 8609NUMERIC &&string &numeric &numeric\cr 8610STRNUM &&string &numeric &numeric\cr 8611}}} 8612@end tex 8613@ifnottex 8614@display 8615 +---------------------------------------------- 8616 | STRING NUMERIC STRNUM 8617--------+---------------------------------------------- 8618 | 8619STRING | string string string 8620 | 8621NUMERIC | string numeric numeric 8622 | 8623STRNUM | string numeric numeric 8624--------+---------------------------------------------- 8625@end display 8626@end ifnottex 8627 8628The basic idea is that user input that looks numeric---and @emph{only} 8629user input---should be treated as numeric, even though it is actually 8630made of characters and is therefore also a string. 8631Thus, for example, the string constant @w{@code{" +3.14"}} 8632is a string, even though it looks numeric, 8633and is @emph{never} treated as number for comparison 8634purposes. 8635 8636In short, when one operand is a ``pure'' string, such as a string 8637constant, then a string comparison is performed. Otherwise, a 8638numeric comparison is performed.@footnote{The POSIX standard is under 8639revision. The revised standard's rules for typing and comparison are 8640the same as just described for @command{gawk}.} 8641 8642@dfn{Comparison expressions} compare strings or numbers for 8643relationships such as equality. They are written using @dfn{relational 8644operators}, which are a superset of those in C. Here is a table of 8645them: 8646 8647@cindex @code{<} (left angle bracket), @code{<} operator 8648@cindex left angle bracket (@code{<}), @code{<} operator 8649@cindex @code{<} (left angle bracket), @code{<=} operator 8650@cindex left angle bracket (@code{<}), @code{<=} operator 8651@cindex @code{>} (right angle bracket), @code{>=} operator 8652@cindex right angle bracket (@code{>}), @code{>=} operator 8653@cindex @code{>} (right angle bracket), @code{>} operator 8654@cindex right angle bracket (@code{>}), @code{>} operator 8655@cindex @code{=} (equals sign), @code{==} operator 8656@cindex equals sign (@code{=}), @code{==} operator 8657@cindex @code{!} (exclamation point), @code{!=} operator 8658@cindex exclamation point (@code{!}), @code{!=} operator 8659@cindex @code{~} (tilde), @code{~} operator 8660@cindex tilde (@code{~}), @code{~} operator 8661@cindex @code{!} (exclamation point), @code{!~} operator 8662@cindex exclamation point (@code{!}), @code{!~} operator 8663@cindex @code{in} operator 8664@table @code 8665@item @var{x} < @var{y} 8666True if @var{x} is less than @var{y}. 8667 8668@item @var{x} <= @var{y} 8669True if @var{x} is less than or equal to @var{y}. 8670 8671@item @var{x} > @var{y} 8672True if @var{x} is greater than @var{y}. 8673 8674@item @var{x} >= @var{y} 8675True if @var{x} is greater than or equal to @var{y}. 8676 8677@item @var{x} == @var{y} 8678True if @var{x} is equal to @var{y}. 8679 8680@item @var{x} != @var{y} 8681True if @var{x} is not equal to @var{y}. 8682 8683@item @var{x} ~ @var{y} 8684True if the string @var{x} matches the regexp denoted by @var{y}. 8685 8686@item @var{x} !~ @var{y} 8687True if the string @var{x} does not match the regexp denoted by @var{y}. 8688 8689@item @var{subscript} in @var{array} 8690True if the array @var{array} has an element with the subscript @var{subscript}. 8691@end table 8692 8693Comparison expressions have the value one if true and zero if false. 8694When comparing operands of mixed types, numeric operands are converted 8695to strings using the value of @code{CONVFMT} 8696(@pxref{Conversion}). 8697 8698Strings are compared 8699by comparing the first character of each, then the second character of each, 8700and so on. Thus, @code{"10"} is less than @code{"9"}. If there are two 8701strings where one is a prefix of the other, the shorter string is less than 8702the longer one. Thus, @code{"abc"} is less than @code{"abcd"}. 8703 8704@cindex troubleshooting, @code{==} operator 8705It is very easy to accidentally mistype the @samp{==} operator and 8706leave off one of the @samp{=} characters. The result is still valid @command{awk} 8707code, but the program does not do what is intended: 8708 8709@example 8710if (a = b) # oops! should be a == b 8711 @dots{} 8712else 8713 @dots{} 8714@end example 8715 8716@noindent 8717Unless @code{b} happens to be zero or the null string, the @code{if} 8718part of the test always succeeds. Because the operators are 8719so similar, this kind of error is very difficult to spot when 8720scanning the source code. 8721 8722@cindex @command{gawk}, comparison operators and 8723The following table of expressions illustrates the kind of comparison 8724@command{gawk} performs, as well as what the result of the comparison is: 8725 8726@table @code 8727@item 1.5 <= 2.0 8728numeric comparison (true) 8729 8730@item "abc" >= "xyz" 8731string comparison (false) 8732 8733@item 1.5 != " +2" 8734string comparison (true) 8735 8736@item "1e2" < "3" 8737string comparison (true) 8738 8739@item a = 2; b = "2" 8740@itemx a == b 8741string comparison (true) 8742 8743@item a = 2; b = " +2" 8744@item a == b 8745string comparison (false) 8746@end table 8747 8748In the next example: 8749 8750@example 8751$ echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}' 8752@print{} false 8753@end example 8754 8755@cindex comparison expressions, string vs. regexp 8756@c @cindex string comparison vs. regexp comparison 8757@c @cindex regexp comparison vs. string comparison 8758@noindent 8759the result is @samp{false} because both @code{$1} and @code{$2} 8760are user input. They are numeric strings---therefore both have 8761the @var{strnum} attribute, dictating a numeric comparison. 8762The purpose of the comparison rules and the use of numeric strings is 8763to attempt to produce the behavior that is ``least surprising,'' while 8764still ``doing the right thing.'' 8765String comparisons and regular expression comparisons are very different. 8766For example: 8767 8768@example 8769x == "foo" 8770@end example 8771 8772@noindent 8773has the value one, or is true if the variable @code{x} 8774is precisely @samp{foo}. By contrast: 8775 8776@example 8777x ~ /foo/ 8778@end example 8779 8780@noindent 8781has the value one if @code{x} contains @samp{foo}, such as 8782@code{"Oh, what a fool am I!"}. 8783 8784@cindex @code{~} (tilde), @code{~} operator 8785@cindex tilde (@code{~}), @code{~} operator 8786@cindex @code{!} (exclamation point), @code{!~} operator 8787@cindex exclamation point (@code{!}), @code{!~} operator 8788The righthand operand of the @samp{~} and @samp{!~} operators may be 8789either a regexp constant (@code{/@dots{}/}) or an ordinary 8790expression. In the latter case, the value of the expression as a string is used as a 8791dynamic regexp (@pxref{Regexp Usage}; also 8792@pxref{Computed Regexps}). 8793 8794@cindex @command{awk}, regexp constants and 8795@cindex regexp constants 8796In modern implementations of @command{awk}, a constant regular 8797expression in slashes by itself is also an expression. The regexp 8798@code{/@var{regexp}/} is an abbreviation for the following comparison expression: 8799 8800@example 8801$0 ~ /@var{regexp}/ 8802@end example 8803 8804One special place where @code{/foo/} is @emph{not} an abbreviation for 8805@samp{$0 ~ /foo/} is when it is the righthand operand of @samp{~} or 8806@samp{!~}. 8807@xref{Using Constant Regexps}, 8808where this is discussed in more detail. 8809@c ENDOFRANGE comex 8810@c ENDOFRANGE excom 8811@c ENDOFRANGE vartypc 8812@c ENDOFRANGE varting 8813 8814@node Boolean Ops 8815@section Boolean Expressions 8816@cindex and Boolean-logic operator 8817@cindex or Boolean-logic operator 8818@cindex not Boolean-logic operator 8819@c STARTOFRANGE exbo 8820@cindex expressions, Boolean 8821@c STARTOFRANGE boex 8822@cindex Boolean expressions 8823@cindex operators, Boolean, See Boolean expressions 8824@cindex Boolean operators, See Boolean expressions 8825@cindex logical operators, See Boolean expressions 8826@cindex operators, logical, See Boolean expressions 8827 8828A @dfn{Boolean expression} is a combination of comparison expressions or 8829matching expressions, using the Boolean operators ``or'' 8830(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with 8831parentheses to control nesting. The truth value of the Boolean expression is 8832computed by combining the truth values of the component expressions. 8833Boolean expressions are also referred to as @dfn{logical expressions}. 8834The terms are equivalent. 8835 8836Boolean expressions can be used wherever comparison and matching 8837expressions can be used. They can be used in @code{if}, @code{while}, 8838@code{do}, and @code{for} statements 8839(@pxref{Statements}). 8840They have numeric values (one if true, zero if false) that come into play 8841if the result of the Boolean expression is stored in a variable or 8842used in arithmetic. 8843 8844In addition, every Boolean expression is also a valid pattern, so 8845you can use one as a pattern to control the execution of rules. 8846The Boolean operators are: 8847 8848@table @code 8849@item @var{boolean1} && @var{boolean2} 8850True if both @var{boolean1} and @var{boolean2} are true. For example, 8851the following statement prints the current input record if it contains 8852both @samp{2400} and @samp{foo}: 8853 8854@example 8855if ($0 ~ /2400/ && $0 ~ /foo/) print 8856@end example 8857 8858@cindex side effects, Boolean operators 8859The subexpression @var{boolean2} is evaluated only if @var{boolean1} 8860is true. This can make a difference when @var{boolean2} contains 8861expressions that have side effects. In the case of @samp{$0 ~ /foo/ && 8862($2 == bar++)}, the variable @code{bar} is not incremented if there is 8863no substring @samp{foo} in the record. 8864 8865@item @var{boolean1} || @var{boolean2} 8866True if at least one of @var{boolean1} or @var{boolean2} is true. 8867For example, the following statement prints all records in the input 8868that contain @emph{either} @samp{2400} or 8869@samp{foo} or both: 8870 8871@example 8872if ($0 ~ /2400/ || $0 ~ /foo/) print 8873@end example 8874 8875The subexpression @var{boolean2} is evaluated only if @var{boolean1} 8876is false. This can make a difference when @var{boolean2} contains 8877expressions that have side effects. 8878 8879@item ! @var{boolean} 8880True if @var{boolean} is false. For example, 8881the following program prints @samp{no home!} in 8882the unusual event that the @env{HOME} environment 8883variable is not defined: 8884 8885@example 8886BEGIN @{ if (! ("HOME" in ENVIRON)) 8887 print "no home!" @} 8888@end example 8889 8890(The @code{in} operator is described in 8891@ref{Reference to Elements}.) 8892@end table 8893 8894@cindex short-circuit operators 8895@cindex operators, short-circuit 8896@cindex @code{&} (ampersand), @code{&&} operator 8897@cindex ampersand (@code{&}), @code{&&} operator 8898@cindex @code{|} (vertical bar), @code{||} operator 8899@cindex vertical bar (@code{|}), @code{||} operator 8900The @samp{&&} and @samp{||} operators are called @dfn{short-circuit} 8901operators because of the way they work. Evaluation of the full expression 8902is ``short-circuited'' if the result can be determined part way through 8903its evaluation. 8904 8905@cindex line continuations 8906Statements that use @samp{&&} or @samp{||} can be continued simply 8907by putting a newline after them. But you cannot put a newline in front 8908of either of these operators without using backslash continuation 8909(@pxref{Statements/Lines}). 8910 8911@cindex @code{!} (exclamation point), @code{!} operator 8912@cindex exclamation point (@code{!}), @code{!} operator 8913@cindex newlines 8914@cindex variables, flag 8915@cindex flag variables 8916The actual value of an expression using the @samp{!} operator is 8917either one or zero, depending upon the truth value of the expression it 8918is applied to. 8919The @samp{!} operator is often useful for changing the sense of a flag 8920variable from false to true and back again. For example, the following 8921program is one way to print lines in between special bracketing lines: 8922 8923@example 8924$1 == "START" @{ interested = ! interested; next @} 8925interested == 1 @{ print @} 8926$1 == "END" @{ interested = ! interested; next @} 8927@end example 8928 8929@noindent 8930The variable @code{interested}, as with all @command{awk} variables, starts 8931out initialized to zero, which is also false. When a line is seen whose 8932first field is @samp{START}, the value of @code{interested} is toggled 8933to true, using @samp{!}. The next rule prints lines as long as 8934@code{interested} is true. When a line is seen whose first field is 8935@samp{END}, @code{interested} is toggled back to false. 8936 8937@ignore 8938Scott Deifik points out that this program isn't robust against 8939bogus input data, but the point is to illustrate the use of `!', 8940so we'll leave well enough alone. 8941@end ignore 8942 8943@cindex @code{next} statement 8944@strong{Note:} The @code{next} statement is discussed in 8945@ref{Next Statement}. 8946@code{next} tells @command{awk} to skip the rest of the rules, get the 8947next record, and start processing the rules over again at the top. 8948The reason it's there is to avoid printing the bracketing 8949@samp{START} and @samp{END} lines. 8950@c ENDOFRANGE exbo 8951@c ENDOFRANGE boex 8952 8953@node Conditional Exp 8954@section Conditional Expressions 8955@cindex conditional expressions 8956@cindex expressions, conditional 8957@cindex expressions, selecting 8958 8959A @dfn{conditional expression} is a special kind of expression that has 8960three operands. It allows you to use one expression's value to select 8961one of two other expressions. 8962The conditional expression is the same as in the C language, 8963as shown here: 8964 8965@example 8966@var{selector} ? @var{if-true-exp} : @var{if-false-exp} 8967@end example 8968 8969@noindent 8970There are three subexpressions. The first, @var{selector}, is always 8971computed first. If it is ``true'' (not zero or not null), then 8972@var{if-true-exp} is computed next and its value becomes the value of 8973the whole expression. Otherwise, @var{if-false-exp} is computed next 8974and its value becomes the value of the whole expression. 8975For example, the following expression produces the absolute value of @code{x}: 8976 8977@example 8978x >= 0 ? x : -x 8979@end example 8980 8981@cindex side effects, conditional expressions 8982Each time the conditional expression is computed, only one of 8983@var{if-true-exp} and @var{if-false-exp} is used; the other is ignored. 8984This is important when the expressions have side effects. For example, 8985this conditional expression examines element @code{i} of either array 8986@code{a} or array @code{b}, and increments @code{i}: 8987 8988@example 8989x == y ? a[i++] : b[i++] 8990@end example 8991 8992@noindent 8993This is guaranteed to increment @code{i} exactly once, because each time 8994only one of the two increment expressions is executed 8995and the other is not. 8996@xref{Arrays}, 8997for more information about arrays. 8998 8999@cindex differences in @command{awk} and @command{gawk}, line continuations 9000@cindex line continuations, @command{gawk} 9001@cindex @command{gawk}, line continuation in 9002As a minor @command{gawk} extension, 9003a statement that uses @samp{?:} can be continued simply 9004by putting a newline after either character. 9005However, putting a newline in front 9006of either character does not work without using backslash continuation 9007(@pxref{Statements/Lines}). 9008If @option{--posix} is specified 9009(@pxref{Options}), then this extension is disabled. 9010 9011@node Function Calls 9012@section Function Calls 9013@cindex function calls 9014 9015A @dfn{function} is a name for a particular calculation. 9016This enables you to 9017ask for it by name at any point in the program. For 9018example, the function @code{sqrt} computes the square root of a number. 9019 9020@cindex functions, built-in 9021A fixed set of functions are @dfn{built-in}, which means they are 9022available in every @command{awk} program. The @code{sqrt} function is one 9023of these. @xref{Built-in}, for a list of built-in 9024functions and their descriptions. In addition, you can define 9025functions for use in your program. 9026@xref{User-defined}, 9027for instructions on how to do this. 9028 9029@cindex arguments, in function calls 9030The way to use a function is with a @dfn{function call} expression, 9031which consists of the function name followed immediately by a list of 9032@dfn{arguments} in parentheses. The arguments are expressions that 9033provide the raw materials for the function's calculations. 9034When there is more than one argument, they are separated by commas. If 9035there are no arguments, just write @samp{()} after the function name. 9036The following examples show function calls with and without arguments: 9037 9038@example 9039sqrt(x^2 + y^2) @i{one argument} 9040atan2(y, x) @i{two arguments} 9041rand() @i{no arguments} 9042@end example 9043 9044@cindex troubleshooting, function call syntax 9045@strong{Caution:} 9046Do not put any space between the function name and the open-parenthesis! 9047A user-defined function name looks just like the name of a 9048variable---a space would make the expression look like concatenation of 9049a variable with an expression inside parentheses. 9050 9051With built-in functions, space before the parenthesis is harmless, but 9052it is best not to get into the habit of using space to avoid mistakes 9053with user-defined functions. Each function expects a particular number 9054of arguments. For example, the @code{sqrt} function must be called with 9055a single argument, the number of which to take the square root: 9056 9057@example 9058sqrt(@var{argument}) 9059@end example 9060 9061Some of the built-in functions have one or 9062more optional arguments. 9063If those arguments are not supplied, the functions 9064use a reasonable default value. 9065@xref{Built-in}, for full details. If arguments 9066are omitted in calls to user-defined functions, then those arguments are 9067treated as local variables and initialized to the empty string 9068(@pxref{User-defined}). 9069 9070@cindex side effects, function calls 9071Like every other expression, the function call has a value, which is 9072computed by the function based on the arguments you give it. In this 9073example, the value of @samp{sqrt(@var{argument})} is the square root of 9074@var{argument}. A function can also have side effects, such as assigning 9075values to certain variables or doing I/O. 9076The following program reads numbers, one number per line, and prints the 9077square root of each one: 9078 9079@example 9080$ awk '@{ print "The square root of", $1, "is", sqrt($1) @}' 90811 9082@print{} The square root of 1 is 1 90833 9084@print{} The square root of 3 is 1.73205 90855 9086@print{} The square root of 5 is 2.23607 9087@kbd{@value{CTL}-d} 9088@end example 9089 9090@node Precedence 9091@section Operator Precedence (How Operators Nest) 9092@c STARTOFRANGE prec 9093@cindex precedence 9094@c STARTOFRANGE oppr 9095@cindex operators, precedence 9096 9097@dfn{Operator precedence} determines how operators are grouped when 9098different operators appear close by in one expression. For example, 9099@samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c} 9100means to multiply @code{b} and @code{c}, and then add @code{a} to the 9101product (i.e., @samp{a + (b * c)}). 9102 9103The normal precedence of the operators can be overruled by using parentheses. 9104Think of the precedence rules as saying where the 9105parentheses are assumed to be. In 9106fact, it is wise to always use parentheses whenever there is an unusual 9107combination of operators, because other people who read the program may 9108not remember what the precedence is in this case. 9109Even experienced programmers occasionally forget the exact rules, 9110which leads to mistakes. 9111Explicit parentheses help prevent 9112any such mistakes. 9113 9114When operators of equal precedence are used together, the leftmost 9115operator groups first, except for the assignment, conditional, and 9116exponentiation operators, which group in the opposite order. 9117Thus, @samp{a - b + c} groups as @samp{(a - b) + c} and 9118@samp{a = b = c} groups as @samp{a = (b = c)}. 9119 9120The precedence of prefix unary operators does not matter as long as only 9121unary operators are involved, because there is only one way to interpret 9122them: innermost first. Thus, @samp{$++i} means @samp{$(++i)} and 9123@samp{++$x} means @samp{++($x)}. However, when another operator follows 9124the operand, then the precedence of the unary operators can matter. 9125@samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means 9126@samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^}, 9127whereas @samp{$} has higher precedence. 9128This table presents @command{awk}'s operators, in order of highest 9129to lowest precedence: 9130 9131@c use @code in the items, looks better in TeX w/o all the quotes 9132@table @code 9133@item (@dots{}) 9134Grouping. 9135 9136@cindex @code{$} (dollar sign), @code{$} field operator 9137@cindex dollar sign (@code{$}), @code{$} field operator 9138@item $ 9139Field. 9140 9141@cindex @code{+} (plus sign), @code{++} operator 9142@cindex plus sign (@code{+}), @code{++} operator 9143@cindex @code{-} (hyphen), @code{--} (decrement/increment) operator 9144@cindex hyphen (@code{-}), @code{--} (decrement/increment) operators 9145@item ++ -- 9146Increment, decrement. 9147 9148@cindex @code{^} (caret), @code{^} operator 9149@cindex caret (@code{^}), @code{^} operator 9150@cindex @code{*} (asterisk), @code{**} operator 9151@cindex asterisk (@code{*}), @code{**} operator 9152@item ^ ** 9153Exponentiation. These operators group right-to-left. 9154 9155@cindex @code{+} (plus sign), @code{+} operator 9156@cindex plus sign (@code{+}), @code{+} operator 9157@cindex @code{-} (hyphen), @code{-} operator 9158@cindex hyphen (@code{-}), @code{-} operator 9159@cindex @code{!} (exclamation point), @code{!} operator 9160@cindex exclamation point (@code{!}), @code{!} operator 9161@item + - ! 9162Unary plus, minus, logical ``not.'' 9163 9164@cindex @code{*} (asterisk), @code{*} operator, as multiplication operator 9165@cindex asterisk (@code{*}), @code{*} operator, as multiplication operator 9166@cindex @code{/} (forward slash), @code{/} operator 9167@cindex forward slash (@code{/}), @code{/} operator 9168@cindex @code{%} (percent sign), @code{%} operator 9169@cindex percent sign (@code{%}), @code{%} operator 9170@item * / % 9171Multiplication, division, modulus. 9172 9173@cindex @code{+} (plus sign), @code{+} operator 9174@cindex plus sign (@code{+}), @code{+} operator 9175@cindex @code{-} (hyphen), @code{-} operator 9176@cindex hyphen (@code{-}), @code{-} operator 9177@item + - 9178Addition, subtraction. 9179 9180@item @r{String Concatenation} 9181No special symbol is used to indicate concatenation. 9182The operands are simply written side by side 9183(@pxref{Concatenation}). 9184 9185@cindex @code{<} (left angle bracket), @code{<} operator 9186@cindex left angle bracket (@code{<}), @code{<} operator 9187@cindex @code{<} (left angle bracket), @code{<=} operator 9188@cindex left angle bracket (@code{<}), @code{<=} operator 9189@cindex @code{>} (right angle bracket), @code{>=} operator 9190@cindex right angle bracket (@code{>}), @code{>=} operator 9191@cindex @code{>} (right angle bracket), @code{>} operator 9192@cindex right angle bracket (@code{>}), @code{>} operator 9193@cindex @code{=} (equals sign), @code{==} operator 9194@cindex equals sign (@code{=}), @code{==} operator 9195@cindex @code{!} (exclamation point), @code{!=} operator 9196@cindex exclamation point (@code{!}), @code{!=} operator 9197@cindex @code{>} (right angle bracket), @code{>>} operator (I/O) 9198@cindex right angle bracket (@code{>}), @code{>>} operator (I/O) 9199@cindex operators, input/output 9200@cindex @code{|} (vertical bar), @code{|} operator (I/O) 9201@cindex vertical bar (@code{|}), @code{|} operator (I/O) 9202@cindex operators, input/output 9203@cindex @code{|} (vertical bar), @code{|&} operator (I/O) 9204@cindex vertical bar (@code{|}), @code{|&} operator (I/O) 9205@cindex operators, input/output 9206@item < <= == != 9207@itemx > >= >> | |& 9208Relational and redirection. 9209The relational operators and the redirections have the same precedence 9210level. Characters such as @samp{>} serve both as relationals and as 9211redirections; the context distinguishes between the two meanings. 9212 9213@cindex @code{print} statement, I/O operators in 9214@cindex @code{printf} statement, I/O operators in 9215Note that the I/O redirection operators in @code{print} and @code{printf} 9216statements belong to the statement level, not to expressions. The 9217redirection does not produce an expression that could be the operand of 9218another operator. As a result, it does not make sense to use a 9219redirection operator near another operator of lower precedence without 9220parentheses. Such combinations (for example, @samp{print foo > a ? b : c}), 9221result in syntax errors. 9222The correct way to write this statement is @samp{print foo > (a ? b : c)}. 9223 9224@cindex @code{~} (tilde), @code{~} operator 9225@cindex tilde (@code{~}), @code{~} operator 9226@cindex @code{!} (exclamation point), @code{!~} operator 9227@cindex exclamation point (@code{!}), @code{!~} operator 9228@item ~ !~ 9229Matching, nonmatching. 9230 9231@cindex @code{in} operator 9232@item in 9233Array membership. 9234 9235@cindex @code{&} (ampersand), @code{&&} operator 9236@cindex ampersand (@code{&}), @code{&&}operator 9237@item && 9238Logical ``and''. 9239 9240@cindex @code{|} (vertical bar), @code{||} operator 9241@cindex vertical bar (@code{|}), @code{||} operator 9242@item || 9243Logical ``or''. 9244 9245@cindex @code{?} (question mark), @code{?:} operator 9246@cindex question mark (@code{?}), @code{?:} operator 9247@item ?: 9248Conditional. This operator groups right-to-left. 9249 9250@cindex @code{+} (plus sign), @code{+=} operator 9251@cindex plus sign (@code{+}), @code{+=} operator 9252@cindex @code{-} (hyphen), @code{-=} operator 9253@cindex hyphen (@code{-}), @code{-=} operator 9254@cindex @code{*} (asterisk), @code{*=} operator 9255@cindex asterisk (@code{*}), @code{*=} operator 9256@cindex @code{*} (asterisk), @code{**=} operator 9257@cindex asterisk (@code{*}), @code{**=} operator 9258@cindex @code{/} (forward slash), @code{/=} operator 9259@cindex forward slash (@code{/}), @code{/=} operator 9260@cindex @code{%} (percent sign), @code{%=} operator 9261@cindex percent sign (@code{%}), @code{%=} operator 9262@cindex @code{^} (caret), @code{^=} operator 9263@cindex caret (@code{^}), @code{^=} operator 9264@item = += -= *= 9265@itemx /= %= ^= **= 9266Assignment. These operators group right to left. 9267@end table 9268 9269@cindex portability, operators, not in POSIX @command{awk} 9270@strong{Note:} 9271The @samp{|&}, @samp{**}, and @samp{**=} operators are not specified by POSIX. 9272For maximum portability, do not use them. 9273@c ENDOFRANGE prec 9274@c ENDOFRANGE oppr 9275@c ENDOFRANGE exps 9276 9277@node Patterns and Actions 9278@chapter Patterns, Actions, and Variables 9279@c STARTOFRANGE pat 9280@cindex patterns 9281 9282As you have already seen, each @command{awk} statement consists of 9283a pattern with an associated action. This @value{CHAPTER} describes how 9284you build patterns and actions, what kinds of things you can do within 9285actions, and @command{awk}'s built-in variables. 9286 9287The pattern-action rules and the statements available for use 9288within actions form the core of @command{awk} programming. 9289In a sense, everything covered 9290up to here has been the foundation 9291that programs are built on top of. Now it's time to start 9292building something useful. 9293 9294@menu 9295* Pattern Overview:: What goes into a pattern. 9296* Using Shell Variables:: How to use shell variables with @command{awk}. 9297* Action Overview:: What goes into an action. 9298* Statements:: Describes the various control statements in 9299 detail. 9300* Built-in Variables:: Summarizes the built-in variables. 9301@end menu 9302 9303@node Pattern Overview 9304@section Pattern Elements 9305 9306@menu 9307* Regexp Patterns:: Using regexps as patterns. 9308* Expression Patterns:: Any expression can be used as a pattern. 9309* Ranges:: Pairs of patterns specify record ranges. 9310* BEGIN/END:: Specifying initialization and cleanup rules. 9311* Empty:: The empty pattern, which matches every record. 9312@end menu 9313 9314@cindex patterns, types of 9315Patterns in @command{awk} control the execution of rules---a rule is 9316executed when its pattern matches the current input record. 9317The following is a summary of the types of @command{awk} patterns: 9318 9319@table @code 9320@item /@var{regular expression}/ 9321A regular expression. It matches when the text of the 9322input record fits the regular expression. 9323(@xref{Regexp}.) 9324 9325@item @var{expression} 9326A single expression. It matches when its value 9327is nonzero (if a number) or non-null (if a string). 9328(@xref{Expression Patterns}.) 9329 9330@item @var{pat1}, @var{pat2} 9331A pair of patterns separated by a comma, specifying a range of records. 9332The range includes both the initial record that matches @var{pat1} and 9333the final record that matches @var{pat2}. 9334(@xref{Ranges}.) 9335 9336@item BEGIN 9337@itemx END 9338Special patterns for you to supply startup or cleanup actions for your 9339@command{awk} program. 9340(@xref{BEGIN/END}.) 9341 9342@item @var{empty} 9343The empty pattern matches every input record. 9344(@xref{Empty}.) 9345@end table 9346 9347@node Regexp Patterns 9348@subsection Regular Expressions as Patterns 9349@cindex patterns, expressions as 9350@cindex regular expressions, as patterns 9351 9352Regular expressions are one of the first kinds of patterns presented 9353in this book. 9354This kind of pattern is simply a regexp constant in the pattern part of 9355a rule. Its meaning is @samp{$0 ~ /@var{pattern}/}. 9356The pattern matches when the input record matches the regexp. 9357For example: 9358 9359@example 9360/foo|bar|baz/ @{ buzzwords++ @} 9361END @{ print buzzwords, "buzzwords seen" @} 9362@end example 9363 9364@node Expression Patterns 9365@subsection Expressions as Patterns 9366@cindex expressions, as patterns 9367 9368Any @command{awk} expression is valid as an @command{awk} pattern. 9369The pattern matches if the expression's value is nonzero (if a 9370number) or non-null (if a string). 9371The expression is reevaluated each time the rule is tested against a new 9372input record. If the expression uses fields such as @code{$1}, the 9373value depends directly on the new input record's text; otherwise, it 9374depends on only what has happened so far in the execution of the 9375@command{awk} program. 9376 9377@cindex comparison expressions, as patterns 9378@cindex patterns, comparison expressions as 9379Comparison expressions, using the comparison operators described in 9380@ref{Typing and Comparison}, 9381are a very common kind of pattern. 9382Regexp matching and nonmatching are also very common expressions. 9383The left operand of the @samp{~} and @samp{!~} operators is a string. 9384The right operand is either a constant regular expression enclosed in 9385slashes (@code{/@var{regexp}/}), or any expression whose string value 9386is used as a dynamic regular expression 9387(@pxref{Computed Regexps}). 9388The following example prints the second field of each input record 9389whose first field is precisely @samp{foo}: 9390 9391@cindex @code{/} (forward slash), patterns and 9392@cindex forward slash (@code{/}), patterns and 9393@cindex @code{~} (tilde), @code{~} operator 9394@cindex tilde (@code{~}), @code{~} operator 9395@cindex @code{!} (exclamation point), @code{!~} operator 9396@cindex exclamation point (@code{!}), @code{!~} operator 9397@example 9398$ awk '$1 == "foo" @{ print $2 @}' BBS-list 9399@end example 9400 9401@noindent 9402(There is no output, because there is no BBS site with the exact name @samp{foo}.) 9403Contrast this with the following regular expression match, which 9404accepts any record with a first field that contains @samp{foo}: 9405 9406@example 9407$ awk '$1 ~ /foo/ @{ print $2 @}' BBS-list 9408@print{} 555-1234 9409@print{} 555-6699 9410@print{} 555-6480 9411@print{} 555-2127 9412@end example 9413 9414@cindex regexp constants, as patterns 9415@cindex patterns, regexp constants as 9416A regexp constant as a pattern is also a special case of an expression 9417pattern. The expression @code{/foo/} has the value one if @samp{foo} 9418appears in the current input record. Thus, as a pattern, @code{/foo/} 9419matches any record containing @samp{foo}. 9420 9421@cindex Boolean expressions, as patterns 9422Boolean expressions are also commonly used as patterns. 9423Whether the pattern 9424matches an input record depends on whether its subexpressions match. 9425For example, the following command prints all the records in 9426@file{BBS-list} that contain both @samp{2400} and @samp{foo}: 9427 9428@example 9429$ awk '/2400/ && /foo/' BBS-list 9430@print{} fooey 555-1234 2400/1200/300 B 9431@end example 9432 9433The following command prints all records in 9434@file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo} 9435(or both, of course): 9436 9437@example 9438$ awk '/2400/ || /foo/' BBS-list 9439@print{} alpo-net 555-3412 2400/1200/300 A 9440@print{} bites 555-1675 2400/1200/300 A 9441@print{} fooey 555-1234 2400/1200/300 B 9442@print{} foot 555-6699 1200/300 B 9443@print{} macfoo 555-6480 1200/300 A 9444@print{} sdace 555-3430 2400/1200/300 A 9445@print{} sabafoo 555-2127 1200/300 C 9446@end example 9447 9448The following command prints all records in 9449@file{BBS-list} that do @emph{not} contain the string @samp{foo}: 9450 9451@example 9452$ awk '! /foo/' BBS-list 9453@print{} aardvark 555-5553 1200/300 B 9454@print{} alpo-net 555-3412 2400/1200/300 A 9455@print{} barfly 555-7685 1200/300 A 9456@print{} bites 555-1675 2400/1200/300 A 9457@print{} camelot 555-0542 300 C 9458@print{} core 555-2912 1200/300 C 9459@print{} sdace 555-3430 2400/1200/300 A 9460@end example 9461 9462@cindex @code{BEGIN} pattern, Boolean patterns and 9463@cindex @code{END} pattern, Boolean patterns and 9464The subexpressions of a Boolean operator in a pattern can be constant regular 9465expressions, comparisons, or any other @command{awk} expressions. Range 9466patterns are not expressions, so they cannot appear inside Boolean 9467patterns. Likewise, the special patterns @code{BEGIN} and @code{END}, 9468which never match any input record, are not expressions and cannot 9469appear inside Boolean patterns. 9470 9471@node Ranges 9472@subsection Specifying Record Ranges with Patterns 9473 9474@cindex range patterns 9475@cindex patterns, ranges in 9476@cindex lines, matching ranges of 9477@cindex @code{,} (comma), in range patterns 9478@cindex comma (@code{,}), in range patterns 9479A @dfn{range pattern} is made of two patterns separated by a comma, in 9480the form @samp{@var{begpat}, @var{endpat}}. It is used to match ranges of 9481consecutive input records. The first pattern, @var{begpat}, controls 9482where the range begins, while @var{endpat} controls where 9483the pattern ends. For example, the following: 9484 9485@example 9486awk '$1 == "on", $1 == "off"' myfile 9487@end example 9488 9489@noindent 9490prints every record in @file{myfile} between @samp{on}/@samp{off} pairs, inclusive. 9491 9492A range pattern starts out by matching @var{begpat} against every 9493input record. When a record matches @var{begpat}, the range pattern is 9494@dfn{turned on} and the range pattern matches this record as well. As long as 9495the range pattern stays turned on, it automatically matches every input 9496record read. The range pattern also matches @var{endpat} against every 9497input record; when this succeeds, the range pattern is turned off again 9498for the following record. Then the range pattern goes back to checking 9499@var{begpat} against each record. 9500 9501@c last comma does NOT start a tertiary 9502@cindex @code{if} statement, actions, changing 9503The record that turns on the range pattern and the one that turns it 9504off both match the range pattern. If you don't want to operate on 9505these records, you can write @code{if} statements in the rule's action 9506to distinguish them from the records you are interested in. 9507 9508It is possible for a pattern to be turned on and off by the same 9509record. If the record satisfies both conditions, then the action is 9510executed for just that record. 9511For example, suppose there is text between two identical markers (e.g., 9512the @samp{%} symbol), each on its own line, that should be ignored. 9513A first attempt would be to 9514combine a range pattern that describes the delimited text with the 9515@code{next} statement 9516(not discussed yet, @pxref{Next Statement}). 9517This causes @command{awk} to skip any further processing of the current 9518record and start over again with the next input record. Such a program 9519looks like this: 9520 9521@example 9522/^%$/,/^%$/ @{ next @} 9523 @{ print @} 9524@end example 9525 9526@noindent 9527@cindex lines, skipping between markers 9528@c @cindex flag variables 9529This program fails because the range pattern is both turned on and turned off 9530by the first line, which just has a @samp{%} on it. To accomplish this task, 9531write the program in the following manner, using a flag: 9532 9533@cindex @code{!} operator 9534@example 9535/^%$/ @{ skip = ! skip; next @} 9536skip == 1 @{ next @} # skip lines with `skip' set 9537@end example 9538 9539In a range pattern, the comma (@samp{,}) has the lowest precedence of 9540all the operators (i.e., it is evaluated last). Thus, the following 9541program attempts to combine a range pattern with another, simpler test: 9542 9543@example 9544echo Yes | awk '/1/,/2/ || /Yes/' 9545@end example 9546 9547The intent of this program is @samp{(/1/,/2/) || /Yes/}. 9548However, @command{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}. 9549This cannot be changed or worked around; range patterns do not combine 9550with other patterns: 9551 9552@example 9553$ echo Yes | gawk '(/1/,/2/) || /Yes/' 9554@error{} gawk: cmd. line:1: (/1/,/2/) || /Yes/ 9555@error{} gawk: cmd. line:1: ^ parse error 9556@error{} gawk: cmd. line:2: (/1/,/2/) || /Yes/ 9557@error{} gawk: cmd. line:2: ^ unexpected newline 9558@end example 9559 9560@node BEGIN/END 9561@subsection The @code{BEGIN} and @code{END} Special Patterns 9562 9563@c STARTOFRANGE beg 9564@cindex @code{BEGIN} pattern 9565@c STARTOFRANGE end 9566@cindex @code{END} pattern 9567All the patterns described so far are for matching input records. 9568The @code{BEGIN} and @code{END} special patterns are different. 9569They supply startup and cleanup actions for @command{awk} programs. 9570@code{BEGIN} and @code{END} rules must have actions; there is no default 9571action for these rules because there is no current record when they run. 9572@code{BEGIN} and @code{END} rules are often referred to as 9573``@code{BEGIN} and @code{END} blocks'' by long-time @command{awk} 9574programmers. 9575 9576@menu 9577* Using BEGIN/END:: How and why to use BEGIN/END rules. 9578* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. 9579@end menu 9580 9581@node Using BEGIN/END 9582@subsubsection Startup and Cleanup Actions 9583 9584A @code{BEGIN} rule is executed once only, before the first input record 9585is read. Likewise, an @code{END} rule is executed once only, after all the 9586input is read. For example: 9587 9588@example 9589$ awk ' 9590> BEGIN @{ print "Analysis of \"foo\"" @} 9591> /foo/ @{ ++n @} 9592> END @{ print "\"foo\" appears", n, "times." @}' BBS-list 9593@print{} Analysis of "foo" 9594@print{} "foo" appears 4 times. 9595@end example 9596 9597@cindex @code{BEGIN} pattern, operators and 9598@cindex @code{END} pattern, operators and 9599This program finds the number of records in the input file @file{BBS-list} 9600that contain the string @samp{foo}. The @code{BEGIN} rule prints a title 9601for the report. There is no need to use the @code{BEGIN} rule to 9602initialize the counter @code{n} to zero, since @command{awk} does this 9603automatically (@pxref{Variables}). 9604The second rule increments the variable @code{n} every time a 9605record containing the pattern @samp{foo} is read. The @code{END} rule 9606prints the value of @code{n} at the end of the run. 9607 9608The special patterns @code{BEGIN} and @code{END} cannot be used in ranges 9609or with Boolean operators (indeed, they cannot be used with any operators). 9610An @command{awk} program may have multiple @code{BEGIN} and/or @code{END} 9611rules. They are executed in the order in which they appear: all the @code{BEGIN} 9612rules at startup and all the @code{END} rules at termination. 9613@code{BEGIN} and @code{END} rules may be intermixed with other rules. 9614This feature was added in the 1987 version of @command{awk} and is included 9615in the POSIX standard. 9616The original (1978) version of @command{awk} 9617required the @code{BEGIN} rule to be placed at the beginning of the 9618program, the @code{END} rule to be placed at the end, and only allowed one of 9619each. 9620This is no longer required, but it is a good idea to follow this template 9621in terms of program organization and readability. 9622 9623Multiple @code{BEGIN} and @code{END} rules are useful for writing 9624library functions, because each library file can have its own @code{BEGIN} and/or 9625@code{END} rule to do its own initialization and/or cleanup. 9626The order in which library functions are named on the command line 9627controls the order in which their @code{BEGIN} and @code{END} rules are 9628executed. Therefore, you have to be careful when writing such rules in 9629library files so that the order in which they are executed doesn't matter. 9630@xref{Options}, for more information on 9631using library functions. 9632@xref{Library Functions}, 9633for a number of useful library functions. 9634 9635If an @command{awk} program has only a @code{BEGIN} rule and no 9636other rules, then the program exits after the @code{BEGIN} rule is 9637run.@footnote{The original version of @command{awk} used to keep 9638reading and ignoring input until the end of the file was seen.} However, if an 9639@code{END} rule exists, then the input is read, even if there are 9640no other rules in the program. This is necessary in case the @code{END} 9641rule checks the @code{FNR} and @code{NR} variables. 9642 9643@node I/O And BEGIN/END 9644@subsubsection Input/Output from @code{BEGIN} and @code{END} Rules 9645 9646@cindex input/output, from @code{BEGIN} and @code{END} 9647There are several (sometimes subtle) points to remember when doing I/O 9648from a @code{BEGIN} or @code{END} rule. 9649The first has to do with the value of @code{$0} in a @code{BEGIN} 9650rule. Because @code{BEGIN} rules are executed before any input is read, 9651there simply is no input record, and therefore no fields, when 9652executing @code{BEGIN} rules. References to @code{$0} and the fields 9653yield a null string or zero, depending upon the context. One way 9654to give @code{$0} a real value is to execute a @code{getline} command 9655without a variable (@pxref{Getline}). 9656Another way is simply to assign a value to @code{$0}. 9657 9658@cindex differences in @command{awk} and @command{gawk}, @code{BEGIN}/@code{END} patterns 9659@cindex POSIX @command{awk}, @code{BEGIN}/@code{END} patterns 9660@cindex @code{print} statement, @code{BEGIN}/@code{END} patterns and 9661@cindex @code{BEGIN} pattern, @code{print} statement and 9662@cindex @code{END} pattern, @code{print} statement and 9663The second point is similar to the first but from the other direction. 9664Traditionally, due largely to implementation issues, @code{$0} and 9665@code{NF} were @emph{undefined} inside an @code{END} rule. 9666The POSIX standard specifies that @code{NF} is available in an @code{END} 9667rule. It contains the number of fields from the last input record. 9668Most probably due to an oversight, the standard does not say that @code{$0} 9669is also preserved, although logically one would think that it should be. 9670In fact, @command{gawk} does preserve the value of @code{$0} for use in 9671@code{END} rules. Be aware, however, that Unix @command{awk}, and possibly 9672other implementations, do not. 9673 9674The third point follows from the first two. The meaning of @samp{print} 9675inside a @code{BEGIN} or @code{END} rule is the same as always: 9676@samp{print $0}. If @code{$0} is the null string, then this prints an 9677empty line. Many long time @command{awk} programmers use an unadorned 9678@samp{print} in @code{BEGIN} and @code{END} rules, to mean @samp{@w{print ""}}, 9679relying on @code{$0} being null. Although one might generally get away with 9680this in @code{BEGIN} rules, it is a very bad idea in @code{END} rules, 9681at least in @command{gawk}. It is also poor style, since if an empty 9682line is needed in the output, the program should print one explicitly. 9683 9684@cindex @code{next} statement, @code{BEGIN}/@code{END} patterns and 9685@cindex @code{nextfile} statement, @code{BEGIN}/@code{END} patterns and 9686@cindex @code{BEGIN} pattern, @code{next}/@code{nextfile} statements and 9687@cindex @code{END} pattern, @code{next}/@code{nextfile} statements and 9688Finally, the @code{next} and @code{nextfile} statements are not allowed 9689in a @code{BEGIN} rule, because the implicit 9690read-a-record-and-match-against-the-rules loop has not started yet. Similarly, those statements 9691are not valid in an @code{END} rule, since all the input has been read. 9692(@xref{Next Statement}, and see 9693@ref{Nextfile Statement}.) 9694@c ENDOFRANGE beg 9695@c ENDOFRANGE end 9696 9697@node Empty 9698@subsection The Empty Pattern 9699 9700@cindex empty pattern 9701@cindex patterns, empty 9702An empty (i.e., nonexistent) pattern is considered to match @emph{every} 9703input record. For example, the program: 9704 9705@example 9706awk '@{ print $1 @}' BBS-list 9707@end example 9708 9709@noindent 9710prints the first field of every record. 9711@c ENDOFRANGE pat 9712 9713@node Using Shell Variables 9714@section Using Shell Variables in Programs 9715@cindex shells, variables 9716@cindex @command{awk} programs, shell variables in 9717@c @cindex shell and @command{awk} interaction 9718 9719@command{awk} programs are often used as components in larger 9720programs written in shell. 9721For example, it is very common to use a shell variable to 9722hold a pattern that the @command{awk} program searches for. 9723There are two ways to get the value of the shell variable 9724into the body of the @command{awk} program. 9725 9726@cindex shells, quoting 9727The most common method is to use shell quoting to substitute 9728the variable's value into the program inside the script. 9729For example, in the following program: 9730 9731@example 9732echo -n "Enter search pattern: " 9733read pattern 9734awk "/$pattern/ "'@{ nmatches++ @} 9735 END @{ print nmatches, "found" @}' /path/to/data 9736@end example 9737 9738@noindent 9739the @command{awk} program consists of two pieces of quoted text 9740that are concatenated together to form the program. 9741The first part is double-quoted, which allows substitution of 9742the @code{pattern} variable inside the quotes. 9743The second part is single-quoted. 9744 9745Variable substitution via quoting works, but can be potentially 9746messy. It requires a good understanding of the shell's quoting rules 9747(@pxref{Quoting}), 9748and it's often difficult to correctly 9749match up the quotes when reading the program. 9750 9751A better method is to use @command{awk}'s variable assignment feature 9752(@pxref{Assignment Options}) 9753to assign the shell variable's value to an @command{awk} variable's 9754value. Then use dynamic regexps to match the pattern 9755(@pxref{Computed Regexps}). 9756The following shows how to redo the 9757previous example using this technique: 9758 9759@example 9760echo -n "Enter search pattern: " 9761read pattern 9762awk -v pat="$pattern" '$0 ~ pat @{ nmatches++ @} 9763 END @{ print nmatches, "found" @}' /path/to/data 9764@end example 9765 9766@noindent 9767Now, the @command{awk} program is just one single-quoted string. 9768The assignment @samp{-v pat="$pattern"} still requires double quotes, 9769in case there is whitespace in the value of @code{$pattern}. 9770The @command{awk} variable @code{pat} could be named @code{pattern} 9771too, but that would be more confusing. Using a variable also 9772provides more flexibility, since the variable can be used anywhere inside 9773the program---for printing, as an array subscript, or for any other 9774use---without requiring the quoting tricks at every point in the program. 9775 9776@node Action Overview 9777@section Actions 9778@c @cindex action, definition of 9779@c @cindex curly braces 9780@c @cindex action, curly braces 9781@c @cindex action, separating statements 9782@cindex actions 9783 9784An @command{awk} program or script consists of a series of 9785rules and function definitions interspersed. (Functions are 9786described later. @xref{User-defined}.) 9787A rule contains a pattern and an action, either of which (but not 9788both) may be omitted. The purpose of the @dfn{action} is to tell 9789@command{awk} what to do once a match for the pattern is found. Thus, 9790in outline, an @command{awk} program generally looks like this: 9791 9792@example 9793@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]} 9794@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]} 9795@dots{} 9796function @var{name}(@var{args}) @{ @dots{} @} 9797@dots{} 9798@end example 9799 9800@cindex @code{@{@}} (braces), actions and 9801@cindex braces (@code{@{@}}), actions and 9802@cindex separators, for statements in actions 9803@cindex newlines, separating statements in actions 9804@cindex @code{;} (semicolon), separating statements in actions 9805@cindex semicolon (@code{;}), separating statements in actions 9806An action consists of one or more @command{awk} @dfn{statements}, enclosed 9807in curly braces (@samp{@{@dots{}@}}). Each statement specifies one 9808thing to do. The statements are separated by newlines or semicolons. 9809The curly braces around an action must be used even if the action 9810contains only one statement, or if it contains no statements at 9811all. However, if you omit the action entirely, omit the curly braces as 9812well. An omitted action is equivalent to @samp{@{ print $0 @}}: 9813 9814@example 9815/foo/ @{ @} @i{match @code{foo}, do nothing --- empty action} 9816/foo/ @i{match @code{foo}, print the record --- omitted action} 9817@end example 9818 9819The following types of statements are supported in @command{awk}: 9820 9821@table @asis 9822@cindex side effects, statements 9823@item Expressions 9824Call functions or assign values to variables 9825(@pxref{Expressions}). Executing 9826this kind of statement simply computes the value of the expression. 9827This is useful when the expression has side effects 9828(@pxref{Assignment Ops}). 9829 9830@item Control statements 9831Specify the control flow of @command{awk} 9832programs. The @command{awk} language gives you C-like constructs 9833(@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few 9834special ones (@pxref{Statements}). 9835 9836@item Compound statements 9837Consist of one or more statements enclosed in 9838curly braces. A compound statement is used in order to put several 9839statements together in the body of an @code{if}, @code{while}, @code{do}, 9840or @code{for} statement. 9841 9842@item Input statements 9843Use the @code{getline} command 9844(@pxref{Getline}). 9845Also supplied in @command{awk} are the @code{next} 9846statement (@pxref{Next Statement}), 9847and the @code{nextfile} statement 9848(@pxref{Nextfile Statement}). 9849 9850@item Output statements 9851Such as @code{print} and @code{printf}. 9852@xref{Printing}. 9853 9854@item Deletion statements 9855For deleting array elements. 9856@xref{Delete}. 9857@end table 9858 9859@node Statements 9860@section Control Statements in Actions 9861@c STARTOFRANGE csta 9862@cindex control statements 9863@c STARTOFRANGE acs 9864@cindex statements, control, in actions 9865@c STARTOFRANGE accs 9866@cindex actions, control statements in 9867 9868@dfn{Control statements}, such as @code{if}, @code{while}, and so on, 9869control the flow of execution in @command{awk} programs. Most of the 9870control statements in @command{awk} are patterned on similar statements in C. 9871 9872@c the comma here does NOT start a secondary 9873@cindex compound statements, control statements and 9874@c the second comma here does NOT start a tertiary 9875@cindex statements, compound, control statements and 9876@cindex body, in actions 9877@cindex @code{@{@}} (braces), statements, grouping 9878@cindex braces (@code{@{@}}), statements, grouping 9879@cindex newlines, separating statements in actions 9880@cindex @code{;} (semicolon), separating statements in actions 9881@cindex semicolon (@code{;}), separating statements in actions 9882All the control statements start with special keywords, such as @code{if} 9883and @code{while}, to distinguish them from simple expressions. 9884Many control statements contain other statements. For example, the 9885@code{if} statement contains another statement that may or may not be 9886executed. The contained statement is called the @dfn{body}. 9887To include more than one statement in the body, group them into a 9888single @dfn{compound statement} with curly braces, separating them with 9889newlines or semicolons. 9890 9891@menu 9892* If Statement:: Conditionally execute some @command{awk} 9893 statements. 9894* While Statement:: Loop until some condition is satisfied. 9895* Do Statement:: Do specified action while looping until some 9896 condition is satisfied. 9897* For Statement:: Another looping statement, that provides 9898 initialization and increment clauses. 9899* Switch Statement:: Switch/case evaluation for conditional 9900 execution of statements based on a value. 9901* Break Statement:: Immediately exit the innermost enclosing loop. 9902* Continue Statement:: Skip to the end of the innermost enclosing 9903 loop. 9904* Next Statement:: Stop processing the current input record. 9905* Nextfile Statement:: Stop processing the current file. 9906* Exit Statement:: Stop execution of @command{awk}. 9907@end menu 9908 9909@node If Statement 9910@subsection The @code{if}-@code{else} Statement 9911 9912@cindex @code{if} statement 9913The @code{if}-@code{else} statement is @command{awk}'s decision-making 9914statement. It looks like this: 9915 9916@example 9917if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]} 9918@end example 9919 9920@noindent 9921The @var{condition} is an expression that controls what the rest of the 9922statement does. If the @var{condition} is true, @var{then-body} is 9923executed; otherwise, @var{else-body} is executed. 9924The @code{else} part of the statement is 9925optional. The condition is considered false if its value is zero or 9926the null string; otherwise, the condition is true. 9927Refer to the following: 9928 9929@example 9930if (x % 2 == 0) 9931 print "x is even" 9932else 9933 print "x is odd" 9934@end example 9935 9936In this example, if the expression @samp{x % 2 == 0} is true (that is, 9937if the value of @code{x} is evenly divisible by two), then the first 9938@code{print} statement is executed; otherwise, the second @code{print} 9939statement is executed. 9940If the @code{else} keyword appears on the same line as @var{then-body} and 9941@var{then-body} is not a compound statement (i.e., not surrounded by 9942curly braces), then a semicolon must separate @var{then-body} from 9943the @code{else}. 9944To illustrate this, the previous example can be rewritten as: 9945 9946@example 9947if (x % 2 == 0) print "x is even"; else 9948 print "x is odd" 9949@end example 9950 9951@noindent 9952If the @samp{;} is left out, @command{awk} can't interpret the statement and 9953it produces a syntax error. Don't actually write programs this way, 9954because a human reader might fail to see the @code{else} if it is not 9955the first thing on its line. 9956 9957@node While Statement 9958@subsection The @code{while} Statement 9959@cindex @code{while} statement 9960@cindex loops 9961@cindex loops, See Also @code{while} statement 9962 9963In programming, a @dfn{loop} is a part of a program that can 9964be executed two or more times in succession. 9965The @code{while} statement is the simplest looping statement in 9966@command{awk}. It repeatedly executes a statement as long as a condition is 9967true. For example: 9968 9969@example 9970while (@var{condition}) 9971 @var{body} 9972@end example 9973 9974@cindex body, in loops 9975@noindent 9976@var{body} is a statement called the @dfn{body} of the loop, 9977and @var{condition} is an expression that controls how long the loop 9978keeps running. 9979The first thing the @code{while} statement does is test the @var{condition}. 9980If the @var{condition} is true, it executes the statement @var{body}. 9981@ifinfo 9982(The @var{condition} is true when the value 9983is not zero and not a null string.) 9984@end ifinfo 9985After @var{body} has been executed, 9986@var{condition} is tested again, and if it is still true, @var{body} is 9987executed again. This process repeats until the @var{condition} is no longer 9988true. If the @var{condition} is initially false, the body of the loop is 9989never executed and @command{awk} continues with the statement following 9990the loop. 9991This example prints the first three fields of each record, one per line: 9992 9993@example 9994awk '@{ i = 1 9995 while (i <= 3) @{ 9996 print $i 9997 i++ 9998 @} 9999@}' inventory-shipped 10000@end example 10001 10002@noindent 10003The body of this loop is a compound statement enclosed in braces, 10004containing two statements. 10005The loop works in the following manner: first, the value of @code{i} is set to one. 10006Then, the @code{while} statement tests whether @code{i} is less than or equal to 10007three. This is true when @code{i} equals one, so the @code{i}-th 10008field is printed. Then the @samp{i++} increments the value of @code{i} 10009and the loop repeats. The loop terminates when @code{i} reaches four. 10010 10011A newline is not required between the condition and the 10012body; however using one makes the program clearer unless the body is a 10013compound statement or else is very simple. The newline after the open-brace 10014that begins the compound statement is not required either, but the 10015program is harder to read without it. 10016 10017@node Do Statement 10018@subsection The @code{do}-@code{while} Statement 10019@cindex @code{do}-@code{while} statement 10020 10021The @code{do} loop is a variation of the @code{while} looping statement. 10022The @code{do} loop executes the @var{body} once and then repeats the 10023@var{body} as long as the @var{condition} is true. It looks like this: 10024 10025@example 10026do 10027 @var{body} 10028while (@var{condition}) 10029@end example 10030 10031Even if the @var{condition} is false at the start, the @var{body} is 10032executed at least once (and only once, unless executing @var{body} 10033makes @var{condition} true). Contrast this with the corresponding 10034@code{while} statement: 10035 10036@example 10037while (@var{condition}) 10038 @var{body} 10039@end example 10040 10041@noindent 10042This statement does not execute @var{body} even once if the @var{condition} 10043is false to begin with. 10044The following is an example of a @code{do} statement: 10045 10046@example 10047@{ i = 1 10048 do @{ 10049 print $0 10050 i++ 10051 @} while (i <= 10) 10052@} 10053@end example 10054 10055@noindent 10056This program prints each input record 10 times. However, it isn't a very 10057realistic example, since in this case an ordinary @code{while} would do 10058just as well. This situation reflects actual experience; only 10059occasionally is there a real use for a @code{do} statement. 10060 10061@node For Statement 10062@subsection The @code{for} Statement 10063@cindex @code{for} statement 10064 10065The @code{for} statement makes it more convenient to count iterations of a 10066loop. The general form of the @code{for} statement looks like this: 10067 10068@example 10069for (@var{initialization}; @var{condition}; @var{increment}) 10070 @var{body} 10071@end example 10072 10073@noindent 10074The @var{initialization}, @var{condition}, and @var{increment} parts are 10075arbitrary @command{awk} expressions, and @var{body} stands for any 10076@command{awk} statement. 10077 10078The @code{for} statement starts by executing @var{initialization}. 10079Then, as long 10080as the @var{condition} is true, it repeatedly executes @var{body} and then 10081@var{increment}. Typically, @var{initialization} sets a variable to 10082either zero or one, @var{increment} adds one to it, and @var{condition} 10083compares it against the desired number of iterations. 10084For example: 10085 10086@example 10087awk '@{ for (i = 1; i <= 3; i++) 10088 print $i 10089@}' inventory-shipped 10090@end example 10091 10092@noindent 10093This prints the first three fields of each input record, with one field per 10094line. 10095 10096It isn't possible to 10097set more than one variable in the 10098@var{initialization} part without using a multiple assignment statement 10099such as @samp{x = y = 0}. This makes sense only if all the initial values 10100are equal. (But it is possible to initialize additional variables by writing 10101their assignments as separate statements preceding the @code{for} loop.) 10102 10103@c @cindex comma operator, not supported 10104The same is true of the @var{increment} part. Incrementing additional 10105variables requires separate statements at the end of the loop. 10106The C compound expression, using C's comma operator, is useful in 10107this context but it is not supported in @command{awk}. 10108 10109Most often, @var{increment} is an increment expression, as in the previous 10110example. But this is not required; it can be any expression 10111whatsoever. For example, the following statement prints all the powers of two 10112between 1 and 100: 10113 10114@example 10115for (i = 1; i <= 100; i *= 2) 10116 print i 10117@end example 10118 10119If there is nothing to be done, any of the three expressions in the 10120parentheses following the @code{for} keyword may be omitted. Thus, 10121@w{@samp{for (; x > 0;)}} is equivalent to @w{@samp{while (x > 0)}}. If the 10122@var{condition} is omitted, it is treated as true, effectively 10123yielding an @dfn{infinite loop} (i.e., a loop that never terminates). 10124 10125In most cases, a @code{for} loop is an abbreviation for a @code{while} 10126loop, as shown here: 10127 10128@example 10129@var{initialization} 10130while (@var{condition}) @{ 10131 @var{body} 10132 @var{increment} 10133@} 10134@end example 10135 10136@cindex loops, @code{continue} statements and 10137@noindent 10138The only exception is when the @code{continue} statement 10139(@pxref{Continue Statement}) is used 10140inside the loop. Changing a @code{for} statement to a @code{while} 10141statement in this way can change the effect of the @code{continue} 10142statement inside the loop. 10143 10144The @command{awk} language has a @code{for} statement in addition to a 10145@code{while} statement because a @code{for} loop is often both less work to 10146type and more natural to think of. Counting the number of iterations is 10147very common in loops. It can be easier to think of this counting as part 10148of looping rather than as something to do inside the loop. 10149 10150@ifinfo 10151@cindex @code{in} operator 10152There is an alternate version of the @code{for} loop, for iterating over 10153all the indices of an array: 10154 10155@example 10156for (i in array) 10157 @var{do something with} array[i] 10158@end example 10159 10160@noindent 10161@xref{Scanning an Array}, 10162for more information on this version of the @code{for} loop. 10163@end ifinfo 10164 10165@node Switch Statement 10166@subsection The @code{switch} Statement 10167@cindex @code{switch} statement 10168@cindex @code{case} keyword 10169@cindex @code{default} keyword 10170 10171@strong{NOTE:} This @value{SUBSECTION} describes an experimental feature 10172added in @command{gawk} 3.1.3. It is @emph{not} enabled by default. To 10173enable it, use the @option{--enable-switch} option to @command{configure} 10174when @command{gawk} is being configured and built. 10175@xref{Additional Configuration Options}, 10176for more information. 10177 10178The @code{switch} statement allows the evaluation of an expression and 10179the execution of statements based on a @code{case} match. Case statements 10180are checked for a match in the order they are defined. If no suitable 10181@code{case} is found, the @code{default} section is executed, if supplied. The 10182general form of the @code{switch} statement looks like this: 10183 10184@example 10185switch (@var{expression}) @{ 10186case @var{value or regular expression}: 10187 @var{case-body} 10188default: 10189 @var{default-body} 10190@} 10191@end example 10192 10193The @code{switch} statement works as it does in C. Once a match to a given 10194case is made, case statement bodies are executed until a @code{break}, 10195@code{continue}, @code{next}, @code{nextfile} or @code{exit} is encountered, 10196or the end of the @code{switch} statement itself. For example: 10197 10198@example 10199switch (NR * 2 + 1) @{ 10200case 3: 10201case "11": 10202 print NR - 1 10203 break 10204 10205case /2[[:digit:]]+/: 10206 print NR 10207 10208default: 10209 print NR + 1 10210 10211case -1: 10212 print NR * -1 10213@} 10214@end example 10215 10216Note that if none of the statements specified above halt execution 10217of a matched @code{case} statement, execution falls through to the 10218next @code{case} until execution halts. In the above example, for 10219any case value starting with @samp{2} followed by one or more digits, 10220the @code{print} statement is executed and then falls through into the 10221@code{default} section, executing its @code{print} statement. In turn, 10222the @minus{}1 case will also be executed since the @code{default} does 10223not halt execution. 10224 10225@node Break Statement 10226@subsection The @code{break} Statement 10227@cindex @code{break} statement 10228@cindex loops, exiting 10229 10230The @code{break} statement jumps out of the innermost @code{for}, 10231@code{while}, or @code{do} loop that encloses it. The following example 10232finds the smallest divisor of any integer, and also identifies prime 10233numbers: 10234 10235@example 10236# find smallest divisor of num 10237@{ 10238 num = $1 10239 for (div = 2; div*div <= num; div++) 10240 if (num % div == 0) 10241 break 10242 if (num % div == 0) 10243 printf "Smallest divisor of %d is %d\n", num, div 10244 else 10245 printf "%d is prime\n", num 10246@} 10247@end example 10248 10249When the remainder is zero in the first @code{if} statement, @command{awk} 10250immediately @dfn{breaks out} of the containing @code{for} loop. This means 10251that @command{awk} proceeds immediately to the statement following the loop 10252and continues processing. (This is very different from the @code{exit} 10253statement, which stops the entire @command{awk} program. 10254@xref{Exit Statement}.) 10255 10256Th following program illustrates how the @var{condition} of a @code{for} 10257or @code{while} statement could be replaced with a @code{break} inside 10258an @code{if}: 10259 10260@example 10261# find smallest divisor of num 10262@{ 10263 num = $1 10264 for (div = 2; ; div++) @{ 10265 if (num % div == 0) @{ 10266 printf "Smallest divisor of %d is %d\n", num, div 10267 break 10268 @} 10269 if (div*div > num) @{ 10270 printf "%d is prime\n", num 10271 break 10272 @} 10273 @} 10274@} 10275@end example 10276 10277@c @cindex @code{break}, outside of loops 10278@c @cindex historical features 10279@c @cindex @command{awk} language, POSIX version 10280@cindex POSIX @command{awk}, @code{break} statement and 10281@cindex dark corner, @code{break} statement 10282@cindex @command{gawk}, @code{break} statement in 10283The @code{break} statement has no meaning when 10284used outside the body of a loop. However, although it was never documented, 10285historical implementations of @command{awk} treated the @code{break} 10286statement outside of a loop as if it were a @code{next} statement 10287(@pxref{Next Statement}). 10288Recent versions of Unix @command{awk} no longer allow this usage. 10289@command{gawk} supports this use of @code{break} only 10290if @option{--traditional} 10291has been specified on the command line 10292(@pxref{Options}). 10293Otherwise, it is treated as an error, since the POSIX standard 10294specifies that @code{break} should only be used inside the body of a 10295loop. 10296@value{DARKCORNER} 10297 10298@node Continue Statement 10299@subsection The @code{continue} Statement 10300 10301@cindex @code{continue} statement 10302As with @code{break}, the @code{continue} statement is used only inside 10303@code{for}, @code{while}, and @code{do} loops. It skips 10304over the rest of the loop body, causing the next cycle around the loop 10305to begin immediately. Contrast this with @code{break}, which jumps out 10306of the loop altogether. 10307 10308The @code{continue} statement in a @code{for} loop directs @command{awk} to 10309skip the rest of the body of the loop and resume execution with the 10310increment-expression of the @code{for} statement. The following program 10311illustrates this fact: 10312 10313@example 10314BEGIN @{ 10315 for (x = 0; x <= 20; x++) @{ 10316 if (x == 5) 10317 continue 10318 printf "%d ", x 10319 @} 10320 print "" 10321@} 10322@end example 10323 10324@noindent 10325This program prints all the numbers from 0 to 20---except for 5, for 10326which the @code{printf} is skipped. Because the increment @samp{x++} 10327is not skipped, @code{x} does not remain stuck at 5. Contrast the 10328@code{for} loop from the previous example with the following @code{while} loop: 10329 10330@example 10331BEGIN @{ 10332 x = 0 10333 while (x <= 20) @{ 10334 if (x == 5) 10335 continue 10336 printf "%d ", x 10337 x++ 10338 @} 10339 print "" 10340@} 10341@end example 10342 10343@noindent 10344This program loops forever once @code{x} reaches 5. 10345 10346@c @cindex @code{continue}, outside of loops 10347@c @cindex historical features 10348@c @cindex @command{awk} language, POSIX version 10349@cindex POSIX @command{awk}, @code{continue} statement and 10350@cindex dark corner, @code{continue} statement 10351@cindex @command{gawk}, @code{continue} statement in 10352The @code{continue} statement has no meaning when used outside the body of 10353a loop. Historical versions of @command{awk} treated a @code{continue} 10354statement outside a loop the same way they treated a @code{break} 10355statement outside a loop: as if it were a @code{next} 10356statement 10357(@pxref{Next Statement}). 10358Recent versions of Unix @command{awk} no longer work this way, and 10359@command{gawk} allows it only if @option{--traditional} is specified on 10360the command line (@pxref{Options}). Just like the 10361@code{break} statement, the POSIX standard specifies that @code{continue} 10362should only be used inside the body of a loop. 10363@value{DARKCORNER} 10364 10365@node Next Statement 10366@subsection The @code{next} Statement 10367@cindex @code{next} statement 10368 10369The @code{next} statement forces @command{awk} to immediately stop processing 10370the current record and go on to the next record. This means that no 10371further rules are executed for the current record, and the rest of the 10372current rule's action isn't executed. 10373 10374Contrast this with the effect of the @code{getline} function 10375(@pxref{Getline}). That also causes 10376@command{awk} to read the next record immediately, but it does not alter the 10377flow of control in any way (i.e., the rest of the current action executes 10378with a new input record). 10379 10380@cindex @command{awk} programs, execution of 10381At the highest level, @command{awk} program execution is a loop that reads 10382an input record and then tests each rule's pattern against it. If you 10383think of this loop as a @code{for} statement whose body contains the 10384rules, then the @code{next} statement is analogous to a @code{continue} 10385statement. It skips to the end of the body of this implicit loop and 10386executes the increment (which reads another record). 10387 10388For example, suppose an @command{awk} program works only on records 10389with four fields, and it shouldn't fail when given bad input. To avoid 10390complicating the rest of the program, write a ``weed out'' rule near 10391the beginning, in the following manner: 10392 10393@example 10394NF != 4 @{ 10395 err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR) 10396 print err > "/dev/stderr" 10397 next 10398@} 10399@end example 10400 10401@noindent 10402Because of the @code{next} statement, 10403the program's subsequent rules won't see the bad record. The error 10404message is redirected to the standard error output stream, as error 10405messages should be. 10406For more detail see 10407@ref{Special Files}. 10408 10409@c @cindex @command{awk} language, POSIX version 10410@c @cindex @code{next}, inside a user-defined function 10411@cindex @code{BEGIN} pattern, @code{next}/@code{nextfile} statements and 10412@cindex @code{END} pattern, @code{next}/@code{nextfile} statements and 10413@cindex POSIX @command{awk}, @code{next}/@code{nextfile} statements and 10414@cindex @code{next} statement, user-defined functions and 10415@cindex functions, user-defined, @code{next}/@code{nextfile} statements and 10416According to the POSIX standard, the behavior is undefined if 10417the @code{next} statement is used in a @code{BEGIN} or @code{END} rule. 10418@command{gawk} treats it as a syntax error. 10419Although POSIX permits it, 10420some other @command{awk} implementations don't allow the @code{next} 10421statement inside function bodies 10422(@pxref{User-defined}). 10423Just as with any other @code{next} statement, a @code{next} statement inside a 10424function body reads the next record and starts processing it with the 10425first rule in the program. 10426If the @code{next} statement causes the end of the input to be reached, 10427then the code in any @code{END} rules is executed. 10428@xref{BEGIN/END}. 10429 10430@node Nextfile Statement 10431@subsection Using @command{gawk}'s @code{nextfile} Statement 10432@cindex @code{nextfile} statement 10433@cindex differences in @command{awk} and @command{gawk}, @code{next}/@code{nextfile} statements 10434 10435@command{gawk} provides the @code{nextfile} statement, 10436which is similar to the @code{next} statement. 10437However, instead of abandoning processing of the current record, the 10438@code{nextfile} statement instructs @command{gawk} to stop processing the 10439current @value{DF}. 10440 10441The @code{nextfile} statement is a @command{gawk} extension. 10442In most other @command{awk} implementations, 10443or if @command{gawk} is in compatibility mode 10444(@pxref{Options}), 10445@code{nextfile} is not special. 10446 10447Upon execution of the @code{nextfile} statement, @code{FILENAME} is 10448updated to the name of the next @value{DF} listed on the command line, 10449@code{FNR} is reset to one, @code{ARGIND} is incremented, and processing 10450starts over with the first rule in the program. 10451(@code{ARGIND} hasn't been introduced yet. @xref{Built-in Variables}.) 10452If the @code{nextfile} statement causes the end of the input to be reached, 10453then the code in any @code{END} rules is executed. 10454@xref{BEGIN/END}. 10455 10456The @code{nextfile} statement is useful when there are many @value{DF}s 10457to process but it isn't necessary to process every record in every file. 10458Normally, in order to move on to the next @value{DF}, a program 10459has to continue scanning the unwanted records. The @code{nextfile} 10460statement accomplishes this much more efficiently. 10461 10462While one might think that @samp{close(FILENAME)} would accomplish 10463the same as @code{nextfile}, this isn't true. @code{close} is 10464reserved for closing files, pipes, and coprocesses that are 10465opened with redirections. It is not related to the main processing that 10466@command{awk} does with the files listed in @code{ARGV}. 10467 10468If it's necessary to use an @command{awk} version that doesn't support 10469@code{nextfile}, see 10470@ref{Nextfile Function}, 10471for a user-defined function that simulates the @code{nextfile} 10472statement. 10473 10474@cindex functions, user-defined, @code{next}/@code{nextfile} statements and 10475@cindex @code{nextfile} statement, user-defined functions and 10476The current version of the Bell Laboratories @command{awk} 10477(@pxref{Other Versions}) 10478also supports @code{nextfile}. However, it doesn't allow the @code{nextfile} 10479statement inside function bodies 10480(@pxref{User-defined}). 10481@command{gawk} does; a @code{nextfile} inside a 10482function body reads the next record and starts processing it with the 10483first rule in the program, just as any other @code{nextfile} statement. 10484 10485@cindex @code{next file} statement, in @command{gawk} 10486@cindex @command{gawk}, @code{next file} statement in 10487@cindex @code{nextfile} statement, in @command{gawk} 10488@cindex @command{gawk}, @code{nextfile} statement in 10489@strong{Caution:} Versions of @command{gawk} prior to 3.0 used two 10490words (@samp{next file}) for the @code{nextfile} statement. 10491In @value{PVERSION} 3.0, this was changed 10492to one word, because the treatment of @samp{file} was 10493inconsistent. When it appeared after @code{next}, @samp{file} was a keyword; 10494otherwise, it was a regular identifier. The old usage is no longer 10495accepted; @samp{next file} generates a syntax error. 10496 10497@node Exit Statement 10498@subsection The @code{exit} Statement 10499 10500@cindex @code{exit} statement 10501The @code{exit} statement causes @command{awk} to immediately stop 10502executing the current rule and to stop processing input; any remaining input 10503is ignored. The @code{exit} statement is written as follows: 10504 10505@example 10506exit @r{[}@var{return code}@r{]} 10507@end example 10508 10509@cindex @code{BEGIN} pattern, @code{exit} statement and 10510@cindex @code{END} pattern, @code{exit} statement and 10511When an @code{exit} statement is executed from a @code{BEGIN} rule, the 10512program stops processing everything immediately. No input records are 10513read. However, if an @code{END} rule is present, 10514as part of executing the @code{exit} statement, 10515the @code{END} rule is executed 10516(@pxref{BEGIN/END}). 10517If @code{exit} is used as part of an @code{END} rule, it causes 10518the program to stop immediately. 10519 10520An @code{exit} statement that is not part of a @code{BEGIN} or @code{END} 10521rule stops the execution of any further automatic rules for the current 10522record, skips reading any remaining input records, and executes the 10523@code{END} rule if there is one. 10524 10525In such a case, 10526if you don't want the @code{END} rule to do its job, set a variable 10527to nonzero before the @code{exit} statement and check that variable in 10528the @code{END} rule. 10529@xref{Assert Function}, 10530for an example that does this. 10531 10532@cindex dark corner, @code{exit} statement 10533If an argument is supplied to @code{exit}, its value is used as the exit 10534status code for the @command{awk} process. If no argument is supplied, 10535@code{exit} returns status zero (success). In the case where an argument 10536is supplied to a first @code{exit} statement, and then @code{exit} is 10537called a second time from an @code{END} rule with no argument, 10538@command{awk} uses the previously supplied exit value. 10539@value{DARKCORNER} 10540 10541@cindex programming conventions, @code{exit} statement 10542For example, suppose an error condition occurs that is difficult or 10543impossible to handle. Conventionally, programs report this by 10544exiting with a nonzero status. An @command{awk} program can do this 10545using an @code{exit} statement with a nonzero argument, as shown 10546in the following example: 10547 10548@example 10549BEGIN @{ 10550 if (("date" | getline date_now) <= 0) @{ 10551 print "Can't get system date" > "/dev/stderr" 10552 exit 1 10553 @} 10554 print "current date is", date_now 10555 close("date") 10556@} 10557@end example 10558@c ENDOFRANGE csta 10559@c ENDOFRANGE acs 10560@c ENDOFRANGE accs 10561 10562@node Built-in Variables 10563@section Built-in Variables 10564@c STARTOFRANGE bvar 10565@cindex built-in variables 10566@c STARTOFRANGE varb 10567@cindex variables, built-in 10568 10569Most @command{awk} variables are available to use for your own 10570purposes; they never change unless your program assigns values to 10571them, and they never affect anything unless your program examines them. 10572However, a few variables in @command{awk} have special built-in meanings. 10573@command{awk} examines some of these automatically, so that they enable you 10574to tell @command{awk} how to do certain things. Others are set 10575automatically by @command{awk}, so that they carry information from the 10576internal workings of @command{awk} to your program. 10577 10578@cindex @command{gawk}, built-in variables and 10579This @value{SECTION} documents all the built-in variables of 10580@command{gawk}, most of which are also documented in the chapters 10581describing their areas of activity. 10582 10583@menu 10584* User-modified:: Built-in variables that you change to control 10585 @command{awk}. 10586* Auto-set:: Built-in variables where @command{awk} gives 10587 you information. 10588* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}. 10589@end menu 10590 10591@node User-modified 10592@subsection Built-in Variables That Control @command{awk} 10593@c STARTOFRANGE bvaru 10594@cindex built-in variables, user-modifiable 10595@c STARTOFRANGE nmbv 10596@cindex user-modifiable variables 10597 10598The following is an alphabetical list of variables that you can change to 10599control how @command{awk} does certain things. The variables that are 10600specific to @command{gawk} are marked with a pound sign@w{ (@samp{#}).} 10601 10602@table @code 10603@cindex @code{BINMODE} variable 10604@cindex binary input/output 10605@cindex input/output, binary 10606@item BINMODE # 10607On non-POSIX systems, this variable specifies use of binary mode for all I/O. 10608Numeric values of one, two, or three specify that input files, output files, or 10609all files, respectively, should use binary I/O. 10610Alternatively, 10611string values of @code{"r"} or @code{"w"} specify that input files and 10612output files, respectively, should use binary I/O. 10613A string value of @code{"rw"} or @code{"wr"} indicates that all 10614files should use binary I/O. 10615Any other string value is equivalent to @code{"rw"}, but @command{gawk} 10616generates a warning message. 10617@code{BINMODE} is described in more detail in 10618@ref{PC Using}. 10619 10620@cindex differences in @command{awk} and @command{gawk}, @code{BINMODE} variable 10621This variable is a @command{gawk} extension. 10622In other @command{awk} implementations 10623(except @command{mawk}, 10624@pxref{Other Versions}), 10625or if @command{gawk} is in compatibility mode 10626(@pxref{Options}), 10627it is not special. 10628 10629@cindex @code{CONVFMT} variable 10630@cindex POSIX @command{awk}, @code{CONVFMT} variable and 10631@cindex numbers, converting, to strings 10632@cindex strings, converting, numbers to 10633@item CONVFMT 10634This string controls conversion of numbers to 10635strings (@pxref{Conversion}). 10636It works by being passed, in effect, as the first argument to the 10637@code{sprintf} function 10638(@pxref{String Functions}). 10639Its default value is @code{"%.6g"}. 10640@code{CONVFMT} was introduced by the POSIX standard. 10641 10642@cindex @code{FIELDWIDTHS} variable 10643@cindex differences in @command{awk} and @command{gawk}, @code{FIELDWIDTHS} variable 10644@cindex field separators, @code{FIELDWIDTHS} variable and 10645@cindex separators, field, @code{FIELDWIDTHS} variable and 10646@item FIELDWIDTHS # 10647This is a space-separated list of columns that tells @command{gawk} 10648how to split input with fixed columnar boundaries. 10649Assigning a value to @code{FIELDWIDTHS} 10650overrides the use of @code{FS} for field splitting. 10651@xref{Constant Size}, for more information. 10652 10653@cindex @command{gawk}, @code{FIELDWIDTHS} variable in 10654If @command{gawk} is in compatibility mode 10655(@pxref{Options}), then @code{FIELDWIDTHS} 10656has no special meaning, and field-splitting operations occur based 10657exclusively on the value of @code{FS}. 10658 10659@cindex @code{FS} variable 10660@cindex separators, field 10661@cindex field separators 10662@item FS 10663This is the input field separator 10664(@pxref{Field Separators}). 10665The value is a single-character string or a multi-character regular 10666expression that matches the separations between fields in an input 10667record. If the value is the null string (@code{""}), then each 10668character in the record becomes a separate field. 10669(This behavior is a @command{gawk} extension. POSIX @command{awk} does not 10670specify the behavior when @code{FS} is the null string.) 10671@c NEXT ED: Mark as common extension 10672 10673@cindex POSIX @command{awk}, @code{FS} variable and 10674The default value is @w{@code{" "}}, a string consisting of a single 10675space. As a special exception, this value means that any 10676sequence of spaces, tabs, and/or newlines is a single separator.@footnote{In 10677POSIX @command{awk}, newline does not count as whitespace.} It also causes 10678spaces, tabs, and newlines at the beginning and end of a record to be ignored. 10679 10680You can set the value of @code{FS} on the command line using the 10681@option{-F} option: 10682 10683@example 10684awk -F, '@var{program}' @var{input-files} 10685@end example 10686 10687@cindex @command{gawk}, field separators and 10688If @command{gawk} is using @code{FIELDWIDTHS} for field splitting, 10689assigning a value to @code{FS} causes @command{gawk} to return to 10690the normal, @code{FS}-based field splitting. An easy way to do this 10691is to simply say @samp{FS = FS}, perhaps with an explanatory comment. 10692 10693@cindex @code{IGNORECASE} variable 10694@cindex differences in @command{awk} and @command{gawk}, @code{IGNORECASE} variable 10695@cindex case sensitivity, string comparisons and 10696@cindex case sensitivity, regexps and 10697@cindex regular expressions, case sensitivity 10698@item IGNORECASE # 10699If @code{IGNORECASE} is nonzero or non-null, then all string comparisons 10700and all regular expression matching are case independent. Thus, regexp 10701matching with @samp{~} and @samp{!~}, as well as the @code{gensub}, 10702@code{gsub}, @code{index}, @code{match}, @code{split}, and @code{sub} 10703functions, record termination with @code{RS}, and field splitting with 10704@code{FS}, all ignore case when doing their particular regexp operations. 10705However, the value of @code{IGNORECASE} does @emph{not} affect array subscripting 10706and it does not affect field splitting when using a single-character 10707field separator. 10708@xref{Case-sensitivity}. 10709 10710@cindex @command{gawk}, @code{IGNORECASE} variable in 10711If @command{gawk} is in compatibility mode 10712(@pxref{Options}), 10713then @code{IGNORECASE} has no special meaning. Thus, string 10714and regexp operations are always case-sensitive. 10715 10716@cindex @code{LINT} variable 10717@cindex differences in @command{awk} and @command{gawk}, @code{LINT} variable 10718@cindex lint checking 10719@item LINT # 10720When this variable is true (nonzero or non-null), @command{gawk} 10721behaves as if the @option{--lint} command-line option is in effect. 10722(@pxref{Options}). 10723With a value of @code{"fatal"}, lint warnings become fatal errors. 10724With a value of @code{"invalid"}, only warnings about things that are 10725actually invalid are issued. (This is not fully implemented yet.) 10726Any other true value prints nonfatal warnings. 10727Assigning a false value to @code{LINT} turns off the lint warnings. 10728 10729@cindex @command{gawk}, @code{LINT} variable in 10730This variable is a @command{gawk} extension. It is not special 10731in other @command{awk} implementations. Unlike the other special variables, 10732changing @code{LINT} does affect the production of lint warnings, 10733even if @command{gawk} is in compatibility mode. Much as 10734the @option{--lint} and @option{--traditional} options independently 10735control different aspects of @command{gawk}'s behavior, the control 10736of lint warnings during program execution is independent of the flavor 10737of @command{awk} being executed. 10738 10739@cindex @code{OFMT} variable 10740@cindex numbers, converting, to strings 10741@cindex strings, converting, numbers to 10742@item OFMT 10743This string controls conversion of numbers to 10744strings (@pxref{Conversion}) for 10745printing with the @code{print} statement. It works by being passed 10746as the first argument to the @code{sprintf} function 10747(@pxref{String Functions}). 10748Its default value is @code{"%.6g"}. Earlier versions of @command{awk} 10749also used @code{OFMT} to specify the format for converting numbers to 10750strings in general expressions; this is now done by @code{CONVFMT}. 10751 10752@cindex @code{sprintf} function, @code{OFMT} variable and 10753@cindex @code{print} statement, @code{OFMT} variable and 10754@cindex @code{OFS} variable 10755@cindex separators, field 10756@cindex field separators 10757@item OFS 10758This is the output field separator (@pxref{Output Separators}). It is 10759output between the fields printed by a @code{print} statement. Its 10760default value is @w{@code{" "}}, a string consisting of a single space. 10761 10762@cindex @code{ORS} variable 10763@item ORS 10764This is the output record separator. It is output at the end of every 10765@code{print} statement. Its default value is @code{"\n"}, the newline 10766character. (@xref{Output Separators}.) 10767 10768@cindex @code{RS} variable 10769@cindex separators, record 10770@cindex record separators 10771@item RS 10772This is @command{awk}'s input record separator. Its default value is a string 10773containing a single newline character, which means that an input record 10774consists of a single line of text. 10775It can also be the null string, in which case records are separated by 10776runs of blank lines. 10777If it is a regexp, records are separated by 10778matches of the regexp in the input text. 10779(@xref{Records}.) 10780 10781The ability for @code{RS} to be a regular expression 10782is a @command{gawk} extension. 10783In most other @command{awk} implementations, 10784or if @command{gawk} is in compatibility mode 10785(@pxref{Options}), 10786just the first character of @code{RS}'s value is used. 10787 10788@cindex @code{SUBSEP} variable 10789@cindex separators, subscript 10790@cindex subscript separators 10791@item SUBSEP 10792This is the subscript separator. It has the default value of 10793@code{"\034"} and is used to separate the parts of the indices of a 10794multidimensional array. Thus, the expression @code{@w{foo["A", "B"]}} 10795really accesses @code{foo["A\034B"]} 10796(@pxref{Multi-dimensional}). 10797 10798@cindex @code{TEXTDOMAIN} variable 10799@cindex differences in @command{awk} and @command{gawk}, @code{TEXTDOMAIN} variable 10800@cindex internationalization, localization 10801@item TEXTDOMAIN # 10802This variable is used for internationalization of programs at the 10803@command{awk} level. It sets the default text domain for specially 10804marked string constants in the source text, as well as for the 10805@code{dcgettext}, @code{dcngettext} and @code{bindtextdomain} functions 10806(@pxref{Internationalization}). 10807The default value of @code{TEXTDOMAIN} is @code{"messages"}. 10808 10809This variable is a @command{gawk} extension. 10810In other @command{awk} implementations, 10811or if @command{gawk} is in compatibility mode 10812(@pxref{Options}), 10813it is not special. 10814@end table 10815@c ENDOFRANGE bvar 10816@c ENDOFRANGE varb 10817@c ENDOFRANGE bvaru 10818@c ENDOFRANGE nmbv 10819 10820@node Auto-set 10821@subsection Built-in Variables That Convey Information 10822 10823@c STARTOFRANGE bvconi 10824@cindex built-in variables, conveying information 10825@c STARTOFRANGE vbconi 10826@cindex variables, built-in, conveying information 10827The following is an alphabetical list of variables that @command{awk} 10828sets automatically on certain occasions in order to provide 10829information to your program. The variables that are specific to 10830@command{gawk} are marked with a pound sign@w{ (@samp{#}).} 10831 10832@table @code 10833@cindex @code{ARGC}/@code{ARGV} variables 10834@cindex arguments, command-line 10835@cindex command line, arguments 10836@item ARGC@r{,} ARGV 10837The command-line arguments available to @command{awk} programs are stored in 10838an array called @code{ARGV}. @code{ARGC} is the number of command-line 10839arguments present. @xref{Other Arguments}. 10840Unlike most @command{awk} arrays, 10841@code{ARGV} is indexed from 0 to @code{ARGC} @minus{} 1. 10842In the following example: 10843 10844@example 10845$ awk 'BEGIN @{ 10846> for (i = 0; i < ARGC; i++) 10847> print ARGV[i] 10848> @}' inventory-shipped BBS-list 10849@print{} awk 10850@print{} inventory-shipped 10851@print{} BBS-list 10852@end example 10853 10854@noindent 10855@code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]} 10856contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains 10857@code{"BBS-list"}. The value of @code{ARGC} is three, one more than the 10858index of the last element in @code{ARGV}, because the elements are numbered 10859from zero. 10860 10861@cindex programming conventions, @code{ARGC}/@code{ARGV} variables 10862The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing 10863the array from 0 to @code{ARGC} @minus{} 1, are derived from the C language's 10864method of accessing command-line arguments. 10865 10866The value of @code{ARGV[0]} can vary from system to system. 10867Also, you should note that the program text is @emph{not} included in 10868@code{ARGV}, nor are any of @command{awk}'s command-line options. 10869@xref{ARGC and ARGV}, for information 10870about how @command{awk} uses these variables. 10871 10872@cindex @code{ARGIND} variable 10873@cindex differences in @command{awk} and @command{gawk}, @code{ARGIND} variable 10874@item ARGIND # 10875The index in @code{ARGV} of the current file being processed. 10876Every time @command{gawk} opens a new @value{DF} for processing, it sets 10877@code{ARGIND} to the index in @code{ARGV} of the @value{FN}. 10878When @command{gawk} is processing the input files, 10879@samp{FILENAME == ARGV[ARGIND]} is always true. 10880 10881@c comma before ARGIND does NOT mark a tertiary 10882@cindex files, processing, @code{ARGIND} variable and 10883This variable is useful in file processing; it allows you to tell how far 10884along you are in the list of @value{DF}s as well as to distinguish between 10885successive instances of the same @value{FN} on the command line. 10886 10887@cindex @value{FN}s, distinguishing 10888While you can change the value of @code{ARGIND} within your @command{awk} 10889program, @command{gawk} automatically sets it to a new value when the 10890next file is opened. 10891 10892This variable is a @command{gawk} extension. 10893In other @command{awk} implementations, 10894or if @command{gawk} is in compatibility mode 10895(@pxref{Options}), 10896it is not special. 10897 10898@cindex @code{ENVIRON} variable 10899@cindex environment variables 10900@item ENVIRON 10901An associative array that contains the values of the environment. The array 10902indices are the environment variable names; the elements are the values of 10903the particular environment variables. For example, 10904@code{ENVIRON["HOME"]} might be @file{/home/arnold}. Changing this array 10905does not affect the environment passed on to any programs that 10906@command{awk} may spawn via redirection or the @code{system} function. 10907@c (In a future version of @command{gawk}, it may do so.) 10908 10909Some operating systems may not have environment variables. 10910On such systems, the @code{ENVIRON} array is empty (except for 10911@w{@code{ENVIRON["AWKPATH"]}}, 10912@pxref{AWKPATH Variable}). 10913 10914@cindex @code{ERRNO} variable 10915@cindex differences in @command{awk} and @command{gawk}, @code{ERRNO} variable 10916@cindex error handling, @code{ERRNO} variable and 10917@item ERRNO # 10918If a system error occurs during a redirection for @code{getline}, 10919during a read for @code{getline}, or during a @code{close} operation, 10920then @code{ERRNO} contains a string describing the error. 10921 10922This variable is a @command{gawk} extension. 10923In other @command{awk} implementations, 10924or if @command{gawk} is in compatibility mode 10925(@pxref{Options}), 10926it is not special. 10927 10928@cindex @code{FILENAME} variable 10929@cindex dark corner, @code{FILENAME} variable 10930@item FILENAME 10931The name of the file that @command{awk} is currently reading. 10932When no @value{DF}s are listed on the command line, @command{awk} reads 10933from the standard input and @code{FILENAME} is set to @code{"-"}. 10934@code{FILENAME} is changed each time a new file is read 10935(@pxref{Reading Files}). 10936Inside a @code{BEGIN} rule, the value of @code{FILENAME} is 10937@code{""}, since there are no input files being processed 10938yet.@footnote{Some early implementations of Unix @command{awk} initialized 10939@code{FILENAME} to @code{"-"}, even if there were @value{DF}s to be 10940processed. This behavior was incorrect and should not be relied 10941upon in your programs.} 10942@value{DARKCORNER} 10943Note, though, that using @code{getline} 10944(@pxref{Getline}) 10945inside a @code{BEGIN} rule can give 10946@code{FILENAME} a value. 10947 10948@cindex @code{FNR} variable 10949@item FNR 10950The current record number in the current file. @code{FNR} is 10951incremented each time a new record is read 10952(@pxref{Getline}). It is reinitialized 10953to zero each time a new input file is started. 10954 10955@cindex @code{NF} variable 10956@item NF 10957The number of fields in the current input record. 10958@code{NF} is set each time a new record is read, when a new field is 10959created or when @code{$0} changes (@pxref{Fields}). 10960 10961Unlike most of the variables described in this 10962@ifnotinfo 10963section, 10964@end ifnotinfo 10965@ifinfo 10966node, 10967@end ifinfo 10968assigning a value to @code{NF} has the potential to affect 10969@command{awk}'s internal workings. In particular, assignments 10970to @code{NF} can be used to create or remove fields from the 10971current record: @xref{Changing Fields}. 10972 10973@cindex @code{NR} variable 10974@item NR 10975The number of input records @command{awk} has processed since 10976the beginning of the program's execution 10977(@pxref{Records}). 10978@code{NR} is incremented each time a new record is read. 10979 10980@cindex @code{PROCINFO} array 10981@cindex differences in @command{awk} and @command{gawk}, @code{PROCINFO} array 10982@item PROCINFO # 10983The elements of this array provide access to information about the 10984running @command{awk} program. 10985The following elements (listed alphabetically) 10986are guaranteed to be available: 10987 10988@table @code 10989@item PROCINFO["egid"] 10990The value of the @code{getegid} system call. 10991 10992@item PROCINFO["euid"] 10993The value of the @code{geteuid} system call. 10994 10995@item PROCINFO["FS"] 10996This is 10997@code{"FS"} if field splitting with @code{FS} is in effect, or it is 10998@code{"FIELDWIDTHS"} if field splitting with @code{FIELDWIDTHS} is in effect. 10999 11000@item PROCINFO["gid"] 11001The value of the @code{getgid} system call. 11002 11003@item PROCINFO["pgrpid"] 11004The process group ID of the current process. 11005 11006@item PROCINFO["pid"] 11007The process ID of the current process. 11008 11009@item PROCINFO["ppid"] 11010The parent process ID of the current process. 11011 11012@item PROCINFO["uid"] 11013The value of the @code{getuid} system call. 11014@end table 11015 11016On some systems, there may be elements in the array, @code{"group1"} 11017through @code{"group@var{N}"} for some @var{N}. @var{N} is the number of 11018supplementary groups that the process has. Use the @code{in} operator 11019to test for these elements 11020(@pxref{Reference to Elements}). 11021 11022This array is a @command{gawk} extension. 11023In other @command{awk} implementations, 11024or if @command{gawk} is in compatibility mode 11025(@pxref{Options}), 11026it is not special. 11027 11028@cindex @code{RLENGTH} variable 11029@item RLENGTH 11030The length of the substring matched by the 11031@code{match} function 11032(@pxref{String Functions}). 11033@code{RLENGTH} is set by invoking the @code{match} function. Its value 11034is the length of the matched string, or @minus{}1 if no match is found. 11035 11036@cindex @code{RSTART} variable 11037@item RSTART 11038The start-index in characters of the substring that is matched by the 11039@code{match} function 11040(@pxref{String Functions}). 11041@code{RSTART} is set by invoking the @code{match} function. Its value 11042is the position of the string where the matched substring starts, or zero 11043if no match was found. 11044 11045@cindex @code{RT} variable 11046@cindex differences in @command{awk} and @command{gawk}, @code{RT} variable 11047@item RT # 11048This is set each time a record is read. It contains the input text 11049that matched the text denoted by @code{RS}, the record separator. 11050 11051This variable is a @command{gawk} extension. 11052In other @command{awk} implementations, 11053or if @command{gawk} is in compatibility mode 11054(@pxref{Options}), 11055it is not special. 11056@end table 11057@c ENDOFRANGE bvconi 11058@c ENDOFRANGE vbconi 11059 11060@c fakenode --- for prepinfo 11061@subheading Advanced Notes: Changing @code{NR} and @code{FNR} 11062@cindex @code{NR} variable, changing 11063@cindex @code{FNR} variable, changing 11064@cindex advanced features, @code{FNR}/@code{NR} variables 11065@cindex dark corner, @code{FNR}/@code{NR} variables 11066@command{awk} increments @code{NR} and @code{FNR} 11067each time it reads a record, instead of setting them to the absolute 11068value of the number of records read. This means that a program can 11069change these variables and their new values are incremented for 11070each record. 11071@value{DARKCORNER} 11072This is demonstrated in the following example: 11073 11074@example 11075$ echo '1 11076> 2 11077> 3 11078> 4' | awk 'NR == 2 @{ NR = 17 @} 11079> @{ print NR @}' 11080@print{} 1 11081@print{} 17 11082@print{} 18 11083@print{} 19 11084@end example 11085 11086@noindent 11087Before @code{FNR} was added to the @command{awk} language 11088(@pxref{V7/SVR3.1}), 11089many @command{awk} programs used this feature to track the number of 11090records in a file by resetting @code{NR} to zero when @code{FILENAME} 11091changed. 11092 11093@node ARGC and ARGV 11094@subsection Using @code{ARGC} and @code{ARGV} 11095@cindex @code{ARGC}/@code{ARGV} variables 11096@cindex arguments, command-line 11097@cindex command line, arguments 11098 11099@ref{Auto-set}, 11100presented the following program describing the information contained in @code{ARGC} 11101and @code{ARGV}: 11102 11103@example 11104$ awk 'BEGIN @{ 11105> for (i = 0; i < ARGC; i++) 11106> print ARGV[i] 11107> @}' inventory-shipped BBS-list 11108@print{} awk 11109@print{} inventory-shipped 11110@print{} BBS-list 11111@end example 11112 11113@noindent 11114In this example, @code{ARGV[0]} contains @samp{awk}, @code{ARGV[1]} 11115contains @samp{inventory-shipped}, and @code{ARGV[2]} contains 11116@samp{BBS-list}. 11117Notice that the @command{awk} program is not entered in @code{ARGV}. The 11118other special command-line options, with their arguments, are also not 11119entered. This includes variable assignments done with the @option{-v} 11120option (@pxref{Options}). 11121Normal variable assignments on the command line @emph{are} 11122treated as arguments and do show up in the @code{ARGV} array: 11123 11124@example 11125$ cat showargs.awk 11126@print{} BEGIN @{ 11127@print{} printf "A=%d, B=%d\n", A, B 11128@print{} for (i = 0; i < ARGC; i++) 11129@print{} printf "\tARGV[%d] = %s\n", i, ARGV[i] 11130@print{} @} 11131@print{} END @{ printf "A=%d, B=%d\n", A, B @} 11132$ awk -v A=1 -f showargs.awk B=2 /dev/null 11133@print{} A=1, B=0 11134@print{} ARGV[0] = awk 11135@print{} ARGV[1] = B=2 11136@print{} ARGV[2] = /dev/null 11137@print{} A=1, B=2 11138@end example 11139 11140A program can alter @code{ARGC} and the elements of @code{ARGV}. 11141Each time @command{awk} reaches the end of an input file, it uses the next 11142element of @code{ARGV} as the name of the next input file. By storing a 11143different string there, a program can change which files are read. 11144Use @code{"-"} to represent the standard input. Storing 11145additional elements and incrementing @code{ARGC} causes 11146additional files to be read. 11147 11148If the value of @code{ARGC} is decreased, that eliminates input files 11149from the end of the list. By recording the old value of @code{ARGC} 11150elsewhere, a program can treat the eliminated arguments as 11151something other than @value{FN}s. 11152 11153To eliminate a file from the middle of the list, store the null string 11154(@code{""}) into @code{ARGV} in place of the file's name. As a 11155special feature, @command{awk} ignores @value{FN}s that have been 11156replaced with the null string. 11157Another option is to 11158use the @code{delete} statement to remove elements from 11159@code{ARGV} (@pxref{Delete}). 11160 11161All of these actions are typically done in the @code{BEGIN} rule, 11162before actual processing of the input begins. 11163@xref{Split Program}, and see 11164@ref{Tee Program}, for examples 11165of each way of removing elements from @code{ARGV}. 11166The following fragment processes @code{ARGV} in order to examine, and 11167then remove, command-line options: 11168@c NEXT ED: Add xref to rewind() function 11169 11170@example 11171BEGIN @{ 11172 for (i = 1; i < ARGC; i++) @{ 11173 if (ARGV[i] == "-v") 11174 verbose = 1 11175 else if (ARGV[i] == "-d") 11176 debug = 1 11177 else if (ARGV[i] ~ /^-?/) @{ 11178 e = sprintf("%s: unrecognized option -- %c", 11179 ARGV[0], substr(ARGV[i], 1, ,1)) 11180 print e > "/dev/stderr" 11181 @} else 11182 break 11183 delete ARGV[i] 11184 @} 11185@} 11186@end example 11187 11188To actually get the options into the @command{awk} program, 11189end the @command{awk} options with @option{--} and then supply 11190the @command{awk} program's options, in the following manner: 11191 11192@example 11193awk -f myprog -- -v -d file1 file2 @dots{} 11194@end example 11195 11196@cindex differences in @command{awk} and @command{gawk}, @code{ARGC}/@code{ARGV} variables 11197This is not necessary in @command{gawk}. Unless @option{--posix} has 11198been specified, @command{gawk} silently puts any unrecognized options 11199into @code{ARGV} for the @command{awk} program to deal with. As soon 11200as it sees an unknown option, @command{gawk} stops looking for other 11201options that it might otherwise recognize. The previous example with 11202@command{gawk} would be: 11203 11204@example 11205gawk -f myprog -d -v file1 file2 @dots{} 11206@end example 11207 11208@noindent 11209Because @option{-d} is not a valid @command{gawk} option, 11210it and the following @option{-v} 11211are passed on to the @command{awk} program. 11212 11213@node Arrays 11214@chapter Arrays in @command{awk} 11215@c STARTOFRANGE arrs 11216@cindex arrays 11217 11218An @dfn{array} is a table of values called @dfn{elements}. The 11219elements of an array are distinguished by their indices. @dfn{Indices} 11220may be either numbers or strings. 11221 11222This @value{CHAPTER} describes how arrays work in @command{awk}, 11223how to use array elements, how to scan through every element in an array, 11224and how to remove array elements. 11225It also describes how @command{awk} simulates multidimensional 11226arrays, as well as some of the less obvious points about array usage. 11227The @value{CHAPTER} finishes with a discussion of @command{gawk}'s facility 11228for sorting an array based on its indices. 11229 11230@cindex variables, names of 11231@cindex functions, names of 11232@cindex arrays, names of 11233@cindex names, arrays/variables 11234@cindex namespace issues 11235@command{awk} maintains a single set 11236of names that may be used for naming variables, arrays, and functions 11237(@pxref{User-defined}). 11238Thus, you cannot have a variable and an array with the same name in the 11239same @command{awk} program. 11240 11241@menu 11242* Array Intro:: Introduction to Arrays 11243* Reference to Elements:: How to examine one element of an array. 11244* Assigning Elements:: How to change an element of an array. 11245* Array Example:: Basic Example of an Array 11246* Scanning an Array:: A variation of the @code{for} statement. It 11247 loops through the indices of an array's 11248 existing elements. 11249* Delete:: The @code{delete} statement removes an element 11250 from an array. 11251* Numeric Array Subscripts:: How to use numbers as subscripts in 11252 @command{awk}. 11253* Uninitialized Subscripts:: Using Uninitialized variables as subscripts. 11254* Multi-dimensional:: Emulating multidimensional arrays in 11255 @command{awk}. 11256* Multi-scanning:: Scanning multidimensional arrays. 11257* Array Sorting:: Sorting array values and indices. 11258@end menu 11259 11260@node Array Intro 11261@section Introduction to Arrays 11262 11263The @command{awk} language provides one-dimensional arrays 11264for storing groups of related strings or numbers. 11265Every @command{awk} array must have a name. Array names have the same 11266syntax as variable names; any valid variable name would also be a valid 11267array name. But one name cannot be used in both ways (as an array and 11268as a variable) in the same @command{awk} program. 11269 11270Arrays in @command{awk} superficially resemble arrays in other programming 11271languages, but there are fundamental differences. In @command{awk}, it 11272isn't necessary to specify the size of an array before starting to use it. 11273Additionally, any number or string in @command{awk}, not just consecutive integers, 11274may be used as an array index. 11275 11276In most other languages, arrays must be @dfn{declared} before use, 11277including a specification of 11278how many elements or components they contain. In such languages, the 11279declaration causes a contiguous block of memory to be allocated for that 11280many elements. Usually, an index in the array must be a positive integer. 11281For example, the index zero specifies the first element in the array, which is 11282actually stored at the beginning of the block of memory. Index one 11283specifies the second element, which is stored in memory right after the 11284first element, and so on. It is impossible to add more elements to the 11285array, because it has room only for as many elements as given in 11286the declaration. 11287(Some languages allow arbitrary starting and ending 11288indices---e.g., @samp{15 .. 27}---but the size of the array is still fixed when 11289the array is declared.) 11290 11291A contiguous array of four elements might look like the following example, 11292conceptually, if the element values are 8, @code{"foo"}, 11293@code{""}, and 30: 11294 11295@c NEXT ED: Use real images here 11296@iftex 11297@c from Karl Berry, much thanks for the help. 11298@tex 11299\bigskip % space above the table (about 1 linespace) 11300\offinterlineskip 11301\newdimen\width \width = 1.5cm 11302\newdimen\hwidth \hwidth = 4\width \advance\hwidth by 2pt % 5 * 0.4pt 11303\centerline{\vbox{ 11304\halign{\strut\hfil\ignorespaces#&&\vrule#&\hbox to\width{\hfil#\unskip\hfil}\cr 11305\noalign{\hrule width\hwidth} 11306 &&{\tt 8} &&{\tt "foo"} &&{\tt ""} &&{\tt 30} &&\quad Value\cr 11307\noalign{\hrule width\hwidth} 11308\noalign{\smallskip} 11309 &\omit&0&\omit &1 &\omit&2 &\omit&3 &\omit&\quad Index\cr 11310} 11311}} 11312@end tex 11313@end iftex 11314@ifinfo 11315@example 11316+---------+---------+--------+---------+ 11317| 8 | "foo" | "" | 30 | @r{Value} 11318+---------+---------+--------+---------+ 11319 0 1 2 3 @r{Index} 11320@end example 11321@end ifinfo 11322@ifxml 11323@example 11324+---------+---------+--------+---------+ 11325| 8 | "foo" | "" | 30 | @r{Value} 11326+---------+---------+--------+---------+ 11327 0 1 2 3 @r{Index} 11328@end example 11329@end ifxml 11330 11331@noindent 11332Only the values are stored; the indices are implicit from the order of 11333the values. Here, 8 is the value at index zero, because 8 appears in the 11334position with zero elements before it. 11335 11336@c STARTOFRANGE arrin 11337@cindex arrays, indexing 11338@c STARTOFRANGE inarr 11339@cindex indexing arrays 11340@cindex associative arrays 11341@cindex arrays, associative 11342Arrays in @command{awk} are different---they are @dfn{associative}. This means 11343that each array is a collection of pairs: an index and its corresponding 11344array element value: 11345 11346@example 11347@r{Element} 3 @r{Value} 30 11348@r{Element} 1 @r{Value} "foo" 11349@r{Element} 0 @r{Value} 8 11350@r{Element} 2 @r{Value} "" 11351@end example 11352 11353@noindent 11354The pairs are shown in jumbled order because their order is irrelevant. 11355 11356One advantage of associative arrays is that new pairs can be added 11357at any time. For example, suppose a tenth element is added to the array 11358whose value is @w{@code{"number ten"}}. The result is: 11359 11360@example 11361@r{Element} 10 @r{Value} "number ten" 11362@r{Element} 3 @r{Value} 30 11363@r{Element} 1 @r{Value} "foo" 11364@r{Element} 0 @r{Value} 8 11365@r{Element} 2 @r{Value} "" 11366@end example 11367 11368@noindent 11369@cindex sparse arrays 11370@cindex arrays, sparse 11371Now the array is @dfn{sparse}, which just means some indices are missing. 11372It has elements 0--3 and 10, but doesn't have elements 4, 5, 6, 7, 8, or 9. 11373 11374Another consequence of associative arrays is that the indices don't 11375have to be positive integers. Any number, or even a string, can be 11376an index. For example, the following is an array that translates words from 11377English to French: 11378 11379@example 11380@r{Element} "dog" @r{Value} "chien" 11381@r{Element} "cat" @r{Value} "chat" 11382@r{Element} "one" @r{Value} "un" 11383@r{Element} 1 @r{Value} "un" 11384@end example 11385 11386@noindent 11387Here we decided to translate the number one in both spelled-out and 11388numeric form---thus illustrating that a single array can have both 11389numbers and strings as indices. 11390In fact, array subscripts are always strings; this is discussed 11391in more detail in 11392@ref{Numeric Array Subscripts}. 11393Here, the number @code{1} isn't double-quoted, since @command{awk} 11394automatically converts it to a string. 11395 11396@cindex case sensitivity, array indices and 11397@cindex arrays, @code{IGNORECASE} variable and 11398@cindex @code{IGNORECASE} variable, array subscripts and 11399The value of @code{IGNORECASE} has no effect upon array subscripting. 11400The identical string value used to store an array element must be used 11401to retrieve it. 11402When @command{awk} creates an array (e.g., with the @code{split} 11403built-in function), 11404that array's indices are consecutive integers starting at one. 11405(@xref{String Functions}.) 11406 11407@command{awk}'s arrays are efficient---the time to access an element 11408is independent of the number of elements in the array. 11409@c ENDOFRANGE arrin 11410@c ENDOFRANGE inarr 11411 11412@node Reference to Elements 11413@section Referring to an Array Element 11414@cindex arrays, elements, referencing 11415@cindex elements in arrays 11416 11417The principal way to use an array is to refer to one of its elements. 11418An array reference is an expression as follows: 11419 11420@example 11421@var{array}[@var{index}] 11422@end example 11423 11424@noindent 11425Here, @var{array} is the name of an array. The expression @var{index} is 11426the index of the desired element of the array. 11427 11428The value of the array reference is the current value of that array 11429element. For example, @code{foo[4.3]} is an expression for the element 11430of array @code{foo} at index @samp{4.3}. 11431 11432A reference to an array element that has no recorded value yields a value of 11433@code{""}, the null string. This includes elements 11434that have not been assigned any value as well as elements that have been 11435deleted (@pxref{Delete}). Such a reference 11436automatically creates that array element, with the null string as its value. 11437(In some cases, this is unfortunate, because it might waste memory inside 11438@command{awk}.) 11439 11440@c @cindex arrays, @code{in} operator and 11441@cindex @code{in} operator, arrays and 11442To determine whether an element exists in an array at a certain index, use 11443the following expression: 11444 11445@example 11446@var{index} in @var{array} 11447@end example 11448 11449@cindex side effects, array indexing 11450@noindent 11451This expression tests whether the particular index exists, 11452without the side effect of creating that element if it is not present. 11453The expression has the value one (true) if @code{@var{array}[@var{index}]} 11454exists and zero (false) if it does not exist. 11455For example, this statement tests whether the array @code{frequencies} 11456contains the index @samp{2}: 11457 11458@example 11459if (2 in frequencies) 11460 print "Subscript 2 is present." 11461@end example 11462 11463Note that this is @emph{not} a test of whether the array 11464@code{frequencies} contains an element whose @emph{value} is two. 11465There is no way to do that except to scan all the elements. Also, this 11466@emph{does not} create @code{frequencies[2]}, while the following 11467(incorrect) alternative does: 11468 11469@example 11470if (frequencies[2] != "") 11471 print "Subscript 2 is present." 11472@end example 11473 11474@node Assigning Elements 11475@section Assigning Array Elements 11476@cindex arrays, elements, assigning 11477@cindex elements in arrays, assigning 11478 11479Array elements can be assigned values just like 11480@command{awk} variables: 11481 11482@example 11483@var{array}[@var{subscript}] = @var{value} 11484@end example 11485 11486@noindent 11487@var{array} is the name of an array. The expression 11488@var{subscript} is the index of the element of the array that is 11489assigned a value. The expression @var{value} is the value to 11490assign to that element of the array. 11491 11492@node Array Example 11493@section Basic Array Example 11494 11495The following program takes a list of lines, each beginning with a line 11496number, and prints them out in order of line number. The line numbers 11497are not in order when they are first read---instead they 11498are scrambled. This program sorts the lines by making an array using 11499the line numbers as subscripts. The program then prints out the lines 11500in sorted order of their numbers. It is a very simple program and gets 11501confused upon encountering repeated numbers, gaps, or lines that don't 11502begin with a number: 11503 11504@example 11505@c file eg/misc/arraymax.awk 11506@{ 11507 if ($1 > max) 11508 max = $1 11509 arr[$1] = $0 11510@} 11511 11512END @{ 11513 for (x = 1; x <= max; x++) 11514 print arr[x] 11515@} 11516@c endfile 11517@end example 11518 11519The first rule keeps track of the largest line number seen so far; 11520it also stores each line into the array @code{arr}, at an index that 11521is the line's number. 11522The second rule runs after all the input has been read, to print out 11523all the lines. 11524When this program is run with the following input: 11525 11526@example 11527@c file eg/misc/arraymax.data 115285 I am the Five man 115292 Who are you? The new number two! 115304 . . . And four on the floor 115311 Who is number one? 115323 I three you. 11533@c endfile 11534@end example 11535 11536@noindent 11537Its output is: 11538 11539@example 115401 Who is number one? 115412 Who are you? The new number two! 115423 I three you. 115434 . . . And four on the floor 115445 I am the Five man 11545@end example 11546 11547If a line number is repeated, the last line with a given number overrides 11548the others. 11549Gaps in the line numbers can be handled with an easy improvement to the 11550program's @code{END} rule, as follows: 11551 11552@example 11553END @{ 11554 for (x = 1; x <= max; x++) 11555 if (x in arr) 11556 print arr[x] 11557@} 11558@end example 11559 11560@node Scanning an Array 11561@section Scanning All Elements of an Array 11562@cindex elements in arrays, scanning 11563@cindex arrays, scanning 11564 11565In programs that use arrays, it is often necessary to use a loop that 11566executes once for each element of an array. In other languages, where 11567arrays are contiguous and indices are limited to positive integers, 11568this is easy: all the valid indices can be found by counting from 11569the lowest index up to the highest. This technique won't do the job 11570in @command{awk}, because any number or string can be an array index. 11571So @command{awk} has a special kind of @code{for} statement for scanning 11572an array: 11573 11574@example 11575for (@var{var} in @var{array}) 11576 @var{body} 11577@end example 11578 11579@noindent 11580@cindex @code{in} operator, arrays and 11581This loop executes @var{body} once for each index in @var{array} that the 11582program has previously used, with the variable @var{var} set to that index. 11583 11584@cindex arrays, @code{for} statement and 11585@cindex @code{for} statement, in arrays 11586The following program uses this form of the @code{for} statement. The 11587first rule scans the input records and notes which words appear (at 11588least once) in the input, by storing a one into the array @code{used} with 11589the word as index. The second rule scans the elements of @code{used} to 11590find all the distinct words that appear in the input. It prints each 11591word that is more than 10 characters long and also prints the number of 11592such words. 11593@xref{String Functions}, 11594for more information on the built-in function @code{length}. 11595 11596@example 11597# Record a 1 for each word that is used at least once 11598@{ 11599 for (i = 1; i <= NF; i++) 11600 used[$i] = 1 11601@} 11602 11603# Find number of distinct words more than 10 characters long 11604END @{ 11605 for (x in used) 11606 if (length(x) > 10) @{ 11607 ++num_long_words 11608 print x 11609 @} 11610 print num_long_words, "words longer than 10 characters" 11611@} 11612@end example 11613 11614@noindent 11615@xref{Word Sorting}, 11616for a more detailed example of this type. 11617 11618@cindex arrays, elements, order of 11619@cindex elements in arrays, order of 11620The order in which elements of the array are accessed by this statement 11621is determined by the internal arrangement of the array elements within 11622@command{awk} and cannot be controlled or changed. This can lead to 11623problems if new elements are added to @var{array} by statements in 11624the loop body; it is not predictable whether the @code{for} loop will 11625reach them. Similarly, changing @var{var} inside the loop may produce 11626strange results. It is best to avoid such things. 11627 11628@node Delete 11629@section The @code{delete} Statement 11630@cindex @code{delete} statement 11631@cindex deleting elements in arrays 11632@cindex arrays, elements, deleting 11633@cindex elements in arrays, deleting 11634 11635To remove an individual element of an array, use the @code{delete} 11636statement: 11637 11638@example 11639delete @var{array}[@var{index}] 11640@end example 11641 11642Once an array element has been deleted, any value the element once 11643had is no longer available. It is as if the element had never 11644been referred to or had been given a value. 11645The following is an example of deleting elements in an array: 11646 11647@example 11648for (i in frequencies) 11649 delete frequencies[i] 11650@end example 11651 11652@noindent 11653This example removes all the elements from the array @code{frequencies}. 11654Once an element is deleted, a subsequent @code{for} statement to scan the array 11655does not report that element and the @code{in} operator to check for 11656the presence of that element returns zero (i.e., false): 11657 11658@example 11659delete foo[4] 11660if (4 in foo) 11661 print "This will never be printed" 11662@end example 11663 11664@cindex null strings, array elements and 11665It is important to note that deleting an element is @emph{not} the 11666same as assigning it a null value (the empty string, @code{""}). 11667For example: 11668 11669@example 11670foo[4] = "" 11671if (4 in foo) 11672 print "This is printed, even though foo[4] is empty" 11673@end example 11674 11675@cindex lint checking, array elements 11676It is not an error to delete an element that does not exist. 11677If @option{--lint} is provided on the command line 11678(@pxref{Options}), 11679@command{gawk} issues a warning message when an element that 11680is not in the array is deleted. 11681 11682@cindex arrays, deleting entire contents 11683@cindex deleting entire arrays 11684@cindex differences in @command{awk} and @command{gawk}, array elements, deleting 11685All the elements of an array may be deleted with a single statement 11686by leaving off the subscript in the @code{delete} statement, 11687as follows: 11688 11689@example 11690delete @var{array} 11691@end example 11692 11693This ability is a @command{gawk} extension; it is not available in 11694compatibility mode (@pxref{Options}). 11695 11696Using this version of the @code{delete} statement is about three times 11697more efficient than the equivalent loop that deletes each element one 11698at a time. 11699 11700@cindex portability, deleting array elements 11701@cindex Brennan, Michael 11702The following statement provides a portable but nonobvious way to clear 11703out an array:@footnote{Thanks to Michael Brennan for pointing this out.} 11704 11705@example 11706split("", array) 11707@end example 11708 11709@c comma before deleting does NOT start a tertiary 11710@cindex @code{split} function, array elements, deleting 11711The @code{split} function 11712(@pxref{String Functions}) 11713clears out the target array first. This call asks it to split 11714apart the null string. Because there is no data to split out, the 11715function simply clears the array and then returns. 11716 11717@strong{Caution:} Deleting an array does not change its type; you cannot 11718delete an array and then use the array's name as a scalar 11719(i.e., a regular variable). For example, the following does not work: 11720 11721@example 11722a[1] = 3; delete a; a = 3 11723@end example 11724 11725@node Numeric Array Subscripts 11726@section Using Numbers to Subscript Arrays 11727 11728@cindex numbers, as array subscripts 11729@cindex arrays, subscripts 11730@cindex subscripts in arrays, numbers as 11731@cindex @code{CONVFMT} variable, array subscripts and 11732An important aspect about arrays to remember is that @emph{array subscripts 11733are always strings}. When a numeric value is used as a subscript, 11734it is converted to a string value before being used for subscripting 11735(@pxref{Conversion}). 11736This means that the value of the built-in variable @code{CONVFMT} can 11737affect how your program accesses elements of an array. For example: 11738 11739@example 11740xyz = 12.153 11741data[xyz] = 1 11742CONVFMT = "%2.2f" 11743if (xyz in data) 11744 printf "%s is in data\n", xyz 11745else 11746 printf "%s is not in data\n", xyz 11747@end example 11748 11749@noindent 11750This prints @samp{12.15 is not in data}. The first statement gives 11751@code{xyz} a numeric value. Assigning to 11752@code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"} 11753(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}). 11754Thus, the array element @code{data["12.153"]} is assigned the value one. 11755The program then changes 11756the value of @code{CONVFMT}. The test @samp{(xyz in data)} generates a new 11757string value from @code{xyz}---this time @code{"12.15"}---because the value of 11758@code{CONVFMT} only allows two significant digits. This test fails, 11759since @code{"12.15"} is a different string from @code{"12.153"}. 11760 11761@cindex converting, during subscripting 11762According to the rules for conversions 11763(@pxref{Conversion}), integer 11764values are always converted to strings as integers, no matter what the 11765value of @code{CONVFMT} may happen to be. So the usual case of 11766the following works: 11767 11768@example 11769for (i = 1; i <= maxsub; i++) 11770 @i{do something with} array[i] 11771@end example 11772 11773The ``integer values always convert to strings as integers'' rule 11774has an additional consequence for array indexing. 11775Octal and hexadecimal constants 11776(@pxref{Nondecimal-numbers}) 11777are converted internally into numbers, and their original form 11778is forgotten. 11779This means, for example, that 11780@code{array[17]}, 11781@code{array[021]}, 11782and 11783@code{array[0x11]} 11784all refer to the same element! 11785 11786As with many things in @command{awk}, the majority of the time 11787things work as one would expect them to. But it is useful to have a precise 11788knowledge of the actual rules which sometimes can have a subtle 11789effect on your programs. 11790 11791@node Uninitialized Subscripts 11792@section Using Uninitialized Variables as Subscripts 11793 11794@c last comma does NOT start a tertiary 11795@cindex variables, uninitialized, as array subscripts 11796@cindex uninitialized variables, as array subscripts 11797@cindex subscripts in arrays, uninitialized variables as 11798@cindex arrays, subscripts, uninitialized variables as 11799Suppose it's necessary to write a program 11800to print the input data in reverse order. 11801A reasonable attempt to do so (with some test 11802data) might look like this: 11803 11804@example 11805$ echo 'line 1 11806> line 2 11807> line 3' | awk '@{ l[lines] = $0; ++lines @} 11808> END @{ 11809> for (i = lines-1; i >= 0; --i) 11810> print l[i] 11811> @}' 11812@print{} line 3 11813@print{} line 2 11814@end example 11815 11816Unfortunately, the very first line of input data did not come out in the 11817output! 11818 11819At first glance, this program should have worked. The variable @code{lines} 11820is uninitialized, and uninitialized variables have the numeric value zero. 11821So, @command{awk} should have printed the value of @code{l[0]}. 11822 11823The issue here is that subscripts for @command{awk} arrays are @emph{always} 11824strings. Uninitialized variables, when used as strings, have the 11825value @code{""}, not zero. Thus, @samp{line 1} ends up stored in 11826@code{l[""]}. 11827The following version of the program works correctly: 11828 11829@example 11830@{ l[lines++] = $0 @} 11831END @{ 11832 for (i = lines - 1; i >= 0; --i) 11833 print l[i] 11834@} 11835@end example 11836 11837Here, the @samp{++} forces @code{lines} to be numeric, thus making 11838the ``old value'' numeric zero. This is then converted to @code{"0"} 11839as the array subscript. 11840 11841@cindex null strings, as array subscripts 11842@cindex dark corner, array subscripts 11843@cindex lint checking, array subscripts 11844Even though it is somewhat unusual, the null string 11845(@code{""}) is a valid array subscript. 11846@value{DARKCORNER} 11847@command{gawk} warns about the use of the null string as a subscript 11848if @option{--lint} is provided 11849on the command line (@pxref{Options}). 11850 11851@node Multi-dimensional 11852@section Multidimensional Arrays 11853 11854@cindex subscripts in arrays, multidimensional 11855@cindex arrays, multidimensional 11856A multidimensional array is an array in which an element is identified 11857by a sequence of indices instead of a single index. For example, a 11858two-dimensional array requires two indices. The usual way (in most 11859languages, including @command{awk}) to refer to an element of a 11860two-dimensional array named @code{grid} is with 11861@code{grid[@var{x},@var{y}]}. 11862 11863@cindex @code{SUBSEP} variable, multidimensional arrays 11864Multidimensional arrays are supported in @command{awk} through 11865concatenation of indices into one string. 11866@command{awk} converts the indices into strings 11867(@pxref{Conversion}) and 11868concatenates them together, with a separator between them. This creates 11869a single string that describes the values of the separate indices. The 11870combined string is used as a single index into an ordinary, 11871one-dimensional array. The separator used is the value of the built-in 11872variable @code{SUBSEP}. 11873 11874For example, suppose we evaluate the expression @samp{foo[5,12] = "value"} 11875when the value of @code{SUBSEP} is @code{"@@"}. The numbers 5 and 12 are 11876converted to strings and 11877concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus, 11878the array element @code{foo["5@@12"]} is set to @code{"value"}. 11879 11880Once the element's value is stored, @command{awk} has no record of whether 11881it was stored with a single index or a sequence of indices. The two 11882expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always 11883equivalent. 11884 11885The default value of @code{SUBSEP} is the string @code{"\034"}, 11886which contains a nonprinting character that is unlikely to appear in an 11887@command{awk} program or in most input data. 11888The usefulness of choosing an unlikely character comes from the fact 11889that index values that contain a string matching @code{SUBSEP} can lead to 11890combined strings that are ambiguous. Suppose that @code{SUBSEP} is 11891@code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a", 11892"b@@c"]}} are indistinguishable because both are actually 11893stored as @samp{foo["a@@b@@c"]}. 11894 11895To test whether a particular index sequence exists in a 11896multidimensional array, use the same operator (@samp{in}) that is 11897used for single dimensional arrays. Write the whole sequence of indices 11898in parentheses, separated by commas, as the left operand: 11899 11900@example 11901(@var{subscript1}, @var{subscript2}, @dots{}) in @var{array} 11902@end example 11903 11904The following example treats its input as a two-dimensional array of 11905fields; it rotates this array 90 degrees clockwise and prints the 11906result. It assumes that all lines have the same number of 11907elements: 11908 11909@example 11910@{ 11911 if (max_nf < NF) 11912 max_nf = NF 11913 max_nr = NR 11914 for (x = 1; x <= NF; x++) 11915 vector[x, NR] = $x 11916@} 11917 11918END @{ 11919 for (x = 1; x <= max_nf; x++) @{ 11920 for (y = max_nr; y >= 1; --y) 11921 printf("%s ", vector[x, y]) 11922 printf("\n") 11923 @} 11924@} 11925@end example 11926 11927@noindent 11928When given the input: 11929 11930@example 119311 2 3 4 5 6 119322 3 4 5 6 1 119333 4 5 6 1 2 119344 5 6 1 2 3 11935@end example 11936 11937@noindent 11938the program produces the following output: 11939 11940@example 119414 3 2 1 119425 4 3 2 119436 5 4 3 119441 6 5 4 119452 1 6 5 119463 2 1 6 11947@end example 11948 11949@node Multi-scanning 11950@section Scanning Multidimensional Arrays 11951 11952There is no special @code{for} statement for scanning a 11953``multidimensional'' array. There cannot be one, because, in truth, there 11954are no multidimensional arrays or elements---there is only a 11955multidimensional @emph{way of accessing} an array. 11956 11957@cindex subscripts in arrays, multidimensional, scanning 11958@cindex arrays, multidimensional, scanning 11959However, if your program has an array that is always accessed as 11960multidimensional, you can get the effect of scanning it by combining 11961the scanning @code{for} statement 11962(@pxref{Scanning an Array}) with the 11963built-in @code{split} function 11964(@pxref{String Functions}). 11965It works in the following manner: 11966 11967@example 11968for (combined in array) @{ 11969 split(combined, separate, SUBSEP) 11970 @dots{} 11971@} 11972@end example 11973 11974@noindent 11975This sets the variable @code{combined} to 11976each concatenated combined index in the array, and splits it 11977into the individual indices by breaking it apart where the value of 11978@code{SUBSEP} appears. The individual indices then become the elements of 11979the array @code{separate}. 11980 11981Thus, if a value is previously stored in @code{array[1, "foo"]}; then 11982an element with index @code{"1\034foo"} exists in @code{array}. (Recall 11983that the default value of @code{SUBSEP} is the character with code 034.) 11984Sooner or later, the @code{for} statement finds that index and does an 11985iteration with the variable @code{combined} set to @code{"1\034foo"}. 11986Then the @code{split} function is called as follows: 11987 11988@example 11989split("1\034foo", separate, "\034") 11990@end example 11991 11992@noindent 11993The result is to set @code{separate[1]} to @code{"1"} and 11994@code{separate[2]} to @code{"foo"}. Presto! The original sequence of 11995separate indices is recovered. 11996 11997@node Array Sorting 11998@section Sorting Array Values and Indices with @command{gawk} 11999 12000@cindex arrays, sorting 12001@cindex @code{asort} function (@command{gawk}) 12002@c last comma does NOT start a tertiary 12003@cindex @code{asort} function (@command{gawk}), arrays, sorting 12004@cindex sort function, arrays, sorting 12005The order in which an array is scanned with a @samp{for (i in array)} 12006loop is essentially arbitrary. 12007In most @command{awk} implementations, sorting an array requires 12008writing a @code{sort} function. 12009While this can be educational for exploring different sorting algorithms, 12010usually that's not the point of the program. 12011@command{gawk} provides the built-in @code{asort} 12012and @code{asorti} functions 12013(@pxref{String Functions}) 12014for sorting arrays. For example: 12015 12016@example 12017@var{populate the array} data 12018n = asort(data) 12019for (i = 1; i <= n; i++) 12020 @var{do something with} data[i] 12021@end example 12022 12023After the call to @code{asort}, the array @code{data} is indexed from 1 12024to some number @var{n}, the total number of elements in @code{data}. 12025(This count is @code{asort}'s return value.) 12026@code{data[1]} @value{LEQ} @code{data[2]} @value{LEQ} @code{data[3]}, and so on. 12027The comparison of array elements is done 12028using @command{gawk}'s usual comparison rules 12029(@pxref{Typing and Comparison}). 12030 12031@cindex side effects, @code{asort} function 12032An important side effect of calling @code{asort} is that 12033@emph{the array's original indices are irrevocably lost}. 12034As this isn't always desirable, @code{asort} accepts a 12035second argument: 12036 12037@example 12038@var{populate the array} source 12039n = asort(source, dest) 12040for (i = 1; i <= n; i++) 12041 @var{do something with} dest[i] 12042@end example 12043 12044In this case, @command{gawk} copies the @code{source} array into the 12045@code{dest} array and then sorts @code{dest}, destroying its indices. 12046However, the @code{source} array is not affected. 12047 12048Often, what's needed is to sort on the values of the @emph{indices} 12049instead of the values of the elements. 12050To do that, starting with @command{gawk} 3.1.2, use the 12051@code{asorti} function. The interface is identical to that of 12052@code{asort}, except that the index values are used for sorting, and 12053become the values of the result array: 12054 12055@example 12056@{ source[$0] = some_func($0) @} 12057 12058END @{ 12059 n = asorti(source, dest) 12060 for (i = 1; i <= n; i++) 12061 @var{do something with} dest[i] 12062@} 12063@end example 12064 12065If your version of @command{gawk} is 3.1.0 or 3.1.1, you don't 12066have @code{asorti}. Instead, use a helper array 12067to hold the sorted index values, and then access the original array's 12068elements. It works in the following way: 12069 12070@example 12071@var{populate the array} data 12072# copy indices 12073j = 1 12074for (i in data) @{ 12075 ind[j] = i # index value becomes element value 12076 j++ 12077@} 12078n = asort(ind) # index values are now sorted 12079for (i = 1; i <= n; i++) 12080 @var{do something with} data[ind[i]] 12081@end example 12082 12083Sorting the array by replacing the indices provides maximal flexibility. 12084To traverse the elements in decreasing order, use a loop that goes from 12085@var{n} down to 1, either over the elements or over the indices. 12086 12087@cindex reference counting, sorting arrays 12088Copying array indices and elements isn't expensive in terms of memory. 12089Internally, @command{gawk} maintains @dfn{reference counts} to data. 12090For example, when @code{asort} copies the first array to the second one, 12091there is only one copy of the original array elements' data, even though 12092both arrays use the values. Similarly, when copying the indices from 12093@code{data} to @code{ind}, there is only one copy of the actual index 12094strings. 12095 12096@c Document It And Call It A Feature. Sigh. 12097@cindex arrays, sorting, @code{IGNORECASE} variable and 12098@cindex @code{IGNORECASE} variable, array sorting and 12099We said previously that comparisons are done using @command{gawk}'s 12100``usual comparison rules.'' Because @code{IGNORECASE} affects 12101string comparisons, the value of @code{IGNORECASE} also 12102affects sorting for both @code{asort} and @code{asorti}. 12103Caveat Emptor. 12104@c ENDOFRANGE arrs 12105 12106@node Functions 12107@chapter Functions 12108 12109@c STARTOFRANGE funcbi 12110@cindex functions, built-in 12111@c STARTOFRANGE bifunc 12112@cindex built-in functions 12113This @value{CHAPTER} describes @command{awk}'s built-in functions, 12114which fall into three categories: numeric, string, and I/O. 12115@command{gawk} provides additional groups of functions 12116to work with values that represent time, do 12117bit manipulation, and internationalize and localize programs. 12118 12119Besides the built-in functions, @command{awk} has provisions for 12120writing new functions that the rest of a program can use. 12121The second half of this @value{CHAPTER} describes these 12122@dfn{user-defined} functions. 12123 12124@menu 12125* Built-in:: Summarizes the built-in functions. 12126* User-defined:: Describes User-defined functions in detail. 12127@end menu 12128 12129@node Built-in 12130@section Built-in Functions 12131 12132@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!! 12133@dfn{Built-in} functions are always available for 12134your @command{awk} program to call. This @value{SECTION} defines all 12135the built-in 12136functions in @command{awk}; some of these are mentioned in other sections 12137but are summarized here for your convenience. 12138 12139@menu 12140* Calling Built-in:: How to call built-in functions. 12141* Numeric Functions:: Functions that work with numbers, including 12142 @code{int}, @code{sin} and @code{rand}. 12143* String Functions:: Functions for string manipulation, such as 12144 @code{split}, @code{match} and @code{sprintf}. 12145* I/O Functions:: Functions for files and shell commands. 12146* Time Functions:: Functions for dealing with timestamps. 12147* Bitwise Functions:: Functions for bitwise operations. 12148* I18N Functions:: Functions for string translation. 12149@end menu 12150 12151@node Calling Built-in 12152@subsection Calling Built-in Functions 12153 12154To call one of @command{awk}'s built-in functions, write the name of 12155the function followed 12156by arguments in parentheses. For example, @samp{atan2(y + z, 1)} 12157is a call to the function @code{atan2} and has two arguments. 12158 12159@cindex programming conventions, functions, calling 12160@c last comma does NOT start a tertiary 12161@cindex whitespace, functions, calling 12162Whitespace is ignored between the built-in function name and the 12163open parenthesis, and it is good practice to avoid using whitespace 12164there. User-defined functions do not permit whitespace in this way, and 12165it is easier to avoid mistakes by following a simple 12166convention that always works---no whitespace after a function name. 12167 12168@c last comma is part of tertiary 12169@cindex troubleshooting, @command{gawk}, fatal errors, function arguments 12170@cindex @command{gawk}, function arguments and 12171@cindex differences in @command{awk} and @command{gawk}, function arguments (@command{gawk}) 12172Each built-in function accepts a certain number of arguments. 12173In some cases, arguments can be omitted. The defaults for omitted 12174arguments vary from function to function and are described under the 12175individual functions. In some @command{awk} implementations, extra 12176arguments given to built-in functions are ignored. However, in @command{gawk}, 12177it is a fatal error to give extra arguments to a built-in function. 12178 12179When a function is called, expressions that create the function's actual 12180parameters are evaluated completely before the call is performed. 12181For example, in the following code fragment: 12182 12183@example 12184i = 4 12185j = sqrt(i++) 12186@end example 12187 12188@cindex evaluation order, functions 12189@cindex functions, built-in, evaluation order 12190@cindex built-in functions, evaluation order 12191@noindent 12192the variable @code{i} is incremented to the value five before @code{sqrt} 12193is called with a value of four for its actual parameter. 12194The order of evaluation of the expressions used for the function's 12195parameters is undefined. Thus, avoid writing programs that 12196assume that parameters are evaluated from left to right or from 12197right to left. For example: 12198 12199@example 12200i = 5 12201j = atan2(i++, i *= 2) 12202@end example 12203 12204If the order of evaluation is left to right, then @code{i} first becomes 122056, and then 12, and @code{atan2} is called with the two arguments 6 12206and 12. But if the order of evaluation is right to left, @code{i} 12207first becomes 10, then 11, and @code{atan2} is called with the 12208two arguments 11 and 10. 12209 12210@node Numeric Functions 12211@subsection Numeric Functions 12212 12213The following list describes all of 12214the built-in functions that work with numbers. 12215Optional parameters are enclosed in square brackets@w{ ([ ]):} 12216 12217@table @code 12218@item int(@var{x}) 12219@cindex @code{int} function 12220This returns the nearest integer to @var{x}, located between @var{x} and zero and 12221truncated toward zero. 12222 12223For example, @code{int(3)} is 3, @code{int(3.9)} is 3, @code{int(-3.9)} 12224is @minus{}3, and @code{int(-3)} is @minus{}3 as well. 12225 12226@item sqrt(@var{x}) 12227@cindex @code{sqrt} function 12228This returns the positive square root of @var{x}. 12229@command{gawk} reports an error 12230if @var{x} is negative. Thus, @code{sqrt(4)} is 2. 12231 12232@item exp(@var{x}) 12233@cindex @code{exp} function 12234This returns the exponential of @var{x} (@code{e ^ @var{x}}) or reports 12235an error if @var{x} is out of range. The range of values @var{x} can have 12236depends on your machine's floating-point representation. 12237 12238@item log(@var{x}) 12239@cindex @code{log} function 12240This returns the natural logarithm of @var{x}, if @var{x} is positive; 12241otherwise, it reports an error. 12242 12243@item sin(@var{x}) 12244@cindex @code{sin} function 12245This returns the sine of @var{x}, with @var{x} in radians. 12246 12247@item cos(@var{x}) 12248@cindex @code{cos} function 12249This returns the cosine of @var{x}, with @var{x} in radians. 12250 12251@item atan2(@var{y}, @var{x}) 12252@cindex @code{atan2} function 12253This returns the arctangent of @code{@var{y} / @var{x}} in radians. 12254 12255@item rand() 12256@cindex @code{rand} function 12257@cindex random numbers, @code{rand}/@code{srand} functions 12258This returns a random number. The values of @code{rand} are 12259uniformly distributed between zero and one. 12260The value could be zero but is never one.@footnote{The C version of @code{rand} 12261is known to produce fairly poor sequences of random numbers. 12262However, nothing requires that an @command{awk} implementation use the C 12263@code{rand} to implement the @command{awk} version of @code{rand}. 12264In fact, @command{gawk} uses the BSD @code{random} function, which is 12265considerably better than @code{rand}, to produce random numbers.} 12266 12267Often random integers are needed instead. Following is a user-defined function 12268that can be used to obtain a random non-negative integer less than @var{n}: 12269 12270@example 12271function randint(n) @{ 12272 return int(n * rand()) 12273@} 12274@end example 12275 12276@noindent 12277The multiplication produces a random number greater than zero and less 12278than @code{n}. Using @code{int}, this result is made into 12279an integer between zero and @code{n} @minus{} 1, inclusive. 12280 12281The following example uses a similar function to produce random integers 12282between one and @var{n}. This program prints a new random number for 12283each input record: 12284 12285@example 12286# Function to roll a simulated die. 12287function roll(n) @{ return 1 + int(rand() * n) @} 12288 12289# Roll 3 six-sided dice and 12290# print total number of points. 12291@{ 12292 printf("%d points\n", 12293 roll(6)+roll(6)+roll(6)) 12294@} 12295@end example 12296 12297@cindex numbers, random 12298@cindex random numbers, seed of 12299@c MAWK uses a different seed each time. 12300@strong{Caution:} In most @command{awk} implementations, including @command{gawk}, 12301@code{rand} starts generating numbers from the same 12302starting number, or @dfn{seed}, each time you run @command{awk}. Thus, 12303a program generates the same results each time you run it. 12304The numbers are random within one @command{awk} run but predictable 12305from run to run. This is convenient for debugging, but if you want 12306a program to do different things each time it is used, you must change 12307the seed to a value that is different in each run. To do this, 12308use @code{srand}. 12309 12310@item srand(@r{[}@var{x}@r{]}) 12311@cindex @code{srand} function 12312The function @code{srand} sets the starting point, or seed, 12313for generating random numbers to the value @var{x}. 12314 12315Each seed value leads to a particular sequence of random 12316numbers.@footnote{Computer-generated random numbers really are not truly 12317random. They are technically known as ``pseudorandom.'' This means 12318that while the numbers in a sequence appear to be random, you can in 12319fact generate the same sequence of random numbers over and over again.} 12320Thus, if the seed is set to the same value a second time, 12321the same sequence of random numbers is produced again. 12322 12323Different @command{awk} implementations use different random-number 12324generators internally. Don't expect the same @command{awk} program 12325to produce the same series of random numbers when executed by 12326different versions of @command{awk}. 12327 12328If the argument @var{x} is omitted, as in @samp{srand()}, then the current 12329date and time of day are used for a seed. This is the way to get random 12330numbers that are truly unpredictable. 12331 12332The return value of @code{srand} is the previous seed. This makes it 12333easy to keep track of the seeds in case you need to consistently reproduce 12334sequences of random numbers. 12335@end table 12336 12337@node String Functions 12338@subsection String-Manipulation Functions 12339 12340The functions in this @value{SECTION} look at or change the text of one or more 12341strings. 12342Optional parameters are enclosed in square brackets@w{ ([ ]).} 12343Those functions that are 12344specific to @command{gawk} are marked with a pound sign@w{ (@samp{#}):} 12345 12346@menu 12347* Gory Details:: More than you want to know about @samp{\} and 12348 @samp{&} with @code{sub}, @code{gsub}, and 12349 @code{gensub}. 12350@end menu 12351 12352@table @code 12353@item asort(@var{source} @r{[}, @var{dest}@r{]}) # 12354@cindex arrays, elements, retrieving number of 12355@cindex @code{asort} function (@command{gawk}) 12356@code{asort} is a @command{gawk}-specific extension, returning the number of 12357elements in the array @var{source}. The contents of @var{source} are 12358sorted using @command{gawk}'s normal rules for comparing values 12359(in particular, @code{IGNORECASE} affects the sorting) 12360and the indices 12361of the sorted values of @var{source} are replaced with sequential 12362integers starting with one. If the optional array @var{dest} is specified, 12363then @var{source} is duplicated into @var{dest}. @var{dest} is then 12364sorted, leaving the indices of @var{source} unchanged. 12365For example, if the contents of @code{a} are as follows: 12366 12367@example 12368a["last"] = "de" 12369a["first"] = "sac" 12370a["middle"] = "cul" 12371@end example 12372 12373@noindent 12374A call to @code{asort}: 12375 12376@example 12377asort(a) 12378@end example 12379 12380@noindent 12381results in the following contents of @code{a}: 12382 12383@example 12384a[1] = "cul" 12385a[2] = "de" 12386a[3] = "sac" 12387@end example 12388 12389The @code{asort} function is described in more detail in 12390@ref{Array Sorting}. 12391@code{asort} is a @command{gawk} extension; it is not available 12392in compatibility mode (@pxref{Options}). 12393 12394@item asorti(@var{source} @r{[}, @var{dest}@r{]}) # 12395@cindex @code{asorti} function (@command{gawk}) 12396@code{asorti} is a @command{gawk}-specific extension, returning the number of 12397elements in the array @var{source}. 12398It works similarly to @code{asort}, however, the @emph{indices} 12399are sorted, instead of the values. As array indices are always strings, 12400the comparison performed is always a string comparison. (Here too, 12401@code{IGNORECASE} affects the sorting.) 12402 12403The @code{asorti} function is described in more detail in 12404@ref{Array Sorting}. 12405It was added in @command{gawk} 3.1.2. 12406@code{asorti} is a @command{gawk} extension; it is not available 12407in compatibility mode (@pxref{Options}). 12408 12409@item index(@var{in}, @var{find}) 12410@cindex @code{index} function 12411@cindex searching 12412This searches the string @var{in} for the first occurrence of the string 12413@var{find}, and returns the position in characters where that occurrence 12414begins in the string @var{in}. Consider the following example: 12415 12416@example 12417$ awk 'BEGIN @{ print index("peanut", "an") @}' 12418@print{} 3 12419@end example 12420 12421@noindent 12422If @var{find} is not found, @code{index} returns zero. 12423(Remember that string indices in @command{awk} start at one.) 12424 12425@item length(@r{[}@var{string}@r{]}) 12426@cindex @code{length} function 12427This returns the number of characters in @var{string}. If 12428@var{string} is a number, the length of the digit string representing 12429that number is returned. For example, @code{length("abcde")} is 5. By 12430contrast, @code{length(15 * 35)} works out to 3. In this example, 15 * 35 = 12431525, and 525 is then converted to the string @code{"525"}, which has 12432three characters. 12433 12434If no argument is supplied, @code{length} returns the length of @code{$0}. 12435 12436@c @cindex historical features 12437@cindex portability, @code{length} function 12438@cindex POSIX @command{awk}, functions and, @code{length} 12439@strong{Note:} 12440In older versions of @command{awk}, the @code{length} function could 12441be called 12442without any parentheses. Doing so is marked as ``deprecated'' in the 12443POSIX standard. This means that while a program can do this, 12444it is a feature that can eventually be removed from a future 12445version of the standard. Therefore, for programs to be maximally portable, 12446always supply the parentheses. 12447 12448@item match(@var{string}, @var{regexp} @r{[}, @var{array}@r{]}) 12449@cindex @code{match} function 12450The @code{match} function searches @var{string} for the 12451longest, leftmost substring matched by the regular expression, 12452@var{regexp}. It returns the character position, or @dfn{index}, 12453at which that substring begins (one, if it starts at the beginning of 12454@var{string}). If no match is found, it returns zero. 12455 12456The @var{regexp} argument may be either a regexp constant 12457(@samp{/@dots{}/}) or a string constant (@var{"@dots{}"}). 12458In the latter case, the string is treated as a regexp to be matched. 12459@ref{Computed Regexps}, for a 12460discussion of the difference between the two forms, and the 12461implications for writing your program correctly. 12462 12463The order of the first two arguments is backwards from most other string 12464functions that work with regular expressions, such as 12465@code{sub} and @code{gsub}. It might help to remember that 12466for @code{match}, the order is the same as for the @samp{~} operator: 12467@samp{@var{string} ~ @var{regexp}}. 12468 12469@cindex @code{RSTART} variable, @code{match} function and 12470@cindex @code{RLENGTH} variable, @code{match} function and 12471@cindex @code{match} function, @code{RSTART}/@code{RLENGTH} variables 12472The @code{match} function sets the built-in variable @code{RSTART} to 12473the index. It also sets the built-in variable @code{RLENGTH} to the 12474length in characters of the matched substring. If no match is found, 12475@code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1. 12476 12477For example: 12478 12479@example 12480@c file eg/misc/findpat.awk 12481@{ 12482 if ($1 == "FIND") 12483 regex = $2 12484 else @{ 12485 where = match($0, regex) 12486 if (where != 0) 12487 print "Match of", regex, "found at", 12488 where, "in", $0 12489 @} 12490@} 12491@c endfile 12492@end example 12493 12494@noindent 12495This program looks for lines that match the regular expression stored in 12496the variable @code{regex}. This regular expression can be changed. If the 12497first word on a line is @samp{FIND}, @code{regex} is changed to be the 12498second word on that line. Therefore, if given: 12499 12500@example 12501@c file eg/misc/findpat.data 12502FIND ru+n 12503My program runs 12504but not very quickly 12505FIND Melvin 12506JF+KM 12507This line is property of Reality Engineering Co. 12508Melvin was here. 12509@c endfile 12510@end example 12511 12512@noindent 12513@command{awk} prints: 12514 12515@example 12516Match of ru+n found at 12 in My program runs 12517Match of Melvin found at 1 in Melvin was here. 12518@end example 12519 12520@cindex differences in @command{awk} and @command{gawk}, @code{match} function 12521If @var{array} is present, it is cleared, and then the 0th element 12522of @var{array} is set to the entire portion of @var{string} 12523matched by @var{regexp}. If @var{regexp} contains parentheses, 12524the integer-indexed elements of @var{array} are set to contain the 12525portion of @var{string} matching the corresponding parenthesized 12526subexpression. 12527For example: 12528 12529@example 12530$ echo foooobazbarrrrr | 12531> gawk '@{ match($0, /(fo+).+(bar*)/, arr) 12532> print arr[1], arr[2] @}' 12533@print{} foooo barrrrr 12534@end example 12535 12536In addition, 12537beginning with @command{gawk} 3.1.2, 12538multidimensional subscripts are available providing 12539the start index and length of each matched subexpression: 12540 12541@example 12542$ echo foooobazbarrrrr | 12543> gawk '@{ match($0, /(fo+).+(bar*)/, arr) 12544> print arr[1], arr[2] 12545> print arr[1, "start"], arr[1, "length"] 12546> print arr[2, "start"], arr[2, "length"] 12547> @}' 12548@print{} foooo barrrrr 12549@print{} 1 5 12550@print{} 9 7 12551@end example 12552 12553There may not be subscripts for the start and index for every parenthesized 12554subexpressions, since they may not all have matched text; thus they 12555should be tested for with the @code{in} operator 12556(@pxref{Reference to Elements}). 12557 12558@cindex troubleshooting, @code{match} function 12559The @var{array} argument to @code{match} is a 12560@command{gawk} extension. In compatibility mode 12561(@pxref{Options}), 12562using a third argument is a fatal error. 12563 12564@item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]}) 12565@cindex @code{split} function 12566This function divides @var{string} into pieces separated by @var{fieldsep} 12567and stores the pieces in @var{array}. The first piece is stored in 12568@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so 12569forth. The string value of the third argument, @var{fieldsep}, is 12570a regexp describing where to split @var{string} (much as @code{FS} can 12571be a regexp describing where to split input records). If 12572@var{fieldsep} is omitted, the value of @code{FS} is used. 12573@code{split} returns the number of elements created. 12574 12575The @code{split} function splits strings into pieces in a 12576manner similar to the way input lines are split into fields. For example: 12577 12578@example 12579split("cul-de-sac", a, "-") 12580@end example 12581 12582@noindent 12583@cindex strings, splitting 12584splits the string @samp{cul-de-sac} into three fields using @samp{-} as the 12585separator. It sets the contents of the array @code{a} as follows: 12586 12587@example 12588a[1] = "cul" 12589a[2] = "de" 12590a[3] = "sac" 12591@end example 12592 12593@noindent 12594The value returned by this call to @code{split} is three. 12595 12596@cindex differences in @command{awk} and @command{gawk}, @code{split} function 12597As with input field-splitting, when the value of @var{fieldsep} is 12598@w{@code{" "}}, leading and trailing whitespace is ignored, and the elements 12599are separated by runs of whitespace. 12600Also as with input field-splitting, if @var{fieldsep} is the null string, each 12601individual character in the string is split into its own array element. 12602(This is a @command{gawk}-specific extension.) 12603 12604Note, however, that @code{RS} has no effect on the way @code{split} 12605works. Even though @samp{RS = ""} causes newline to also be an input 12606field separator, this does not affect how @code{split} splits strings. 12607 12608@cindex dark corner, @code{split} function 12609Modern implementations of @command{awk}, including @command{gawk}, allow 12610the third argument to be a regexp constant (@code{/abc/}) as well as a 12611string. 12612@value{DARKCORNER} 12613The POSIX standard allows this as well. 12614@ref{Computed Regexps}, for a 12615discussion of the difference between using a string constant or a regexp constant, 12616and the implications for writing your program correctly. 12617 12618Before splitting the string, @code{split} deletes any previously existing 12619elements in the array @var{array}. 12620 12621If @var{string} is null, the array has no elements. (So this is a portable 12622way to delete an entire array with one statement. 12623@xref{Delete}.) 12624 12625If @var{string} does not match @var{fieldsep} at all (but is not null), 12626@var{array} has one element only. The value of that element is the original 12627@var{string}. 12628 12629@item sprintf(@var{format}, @var{expression1}, @dots{}) 12630@cindex @code{sprintf} function 12631This returns (without printing) the string that @code{printf} would 12632have printed out with the same arguments 12633(@pxref{Printf}). 12634For example: 12635 12636@example 12637pival = sprintf("pi = %.2f (approx.)", 22/7) 12638@end example 12639 12640@noindent 12641assigns the string @w{@code{"pi = 3.14 (approx.)"}} to the variable @code{pival}. 12642 12643@cindex differences in @command{awk} and @command{gawk}, @code{strtonum} function (@command{gawk}) 12644@cindex @code{strtonum} function (@command{gawk}) 12645@item strtonum(@var{str}) # 12646Examines @var{str} and returns its numeric value. If @var{str} 12647begins with a leading @samp{0}, @code{strtonum} assumes that @var{str} 12648is an octal number. If @var{str} begins with a leading @samp{0x} or 12649@samp{0X}, @code{strtonum} assumes that @var{str} is a hexadecimal number. 12650For example: 12651 12652@example 12653$ echo 0x11 | 12654> gawk '@{ printf "%d\n", strtonum($1) @}' 12655@print{} 17 12656@end example 12657 12658Using the @code{strtonum} function is @emph{not} the same as adding zero 12659to a string value; the automatic coercion of strings to numbers 12660works only for decimal data, not for octal or hexadecimal.@footnote{Unless 12661you use the @option{--non-decimal-data} option, which isn't recommended. 12662@xref{Nondecimal Data}, for more information.} 12663 12664@cindex differences in @command{awk} and @command{gawk}, @code{strtonum} function (@command{gawk}) 12665@code{strtonum} is a @command{gawk} extension; it is not available 12666in compatibility mode (@pxref{Options}). 12667 12668@item sub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]}) 12669@cindex @code{sub} function 12670The @code{sub} function alters the value of @var{target}. 12671It searches this value, which is treated as a string, for the 12672leftmost, longest substring matched by the regular expression @var{regexp}. 12673Then the entire string is 12674changed by replacing the matched text with @var{replacement}. 12675The modified string becomes the new value of @var{target}. 12676 12677The @var{regexp} argument may be either a regexp constant 12678(@samp{/@dots{}/}) or a string constant (@var{"@dots{}"}). 12679In the latter case, the string is treated as a regexp to be matched. 12680@ref{Computed Regexps}, for a 12681discussion of the difference between the two forms, and the 12682implications for writing your program correctly. 12683 12684This function is peculiar because @var{target} is not simply 12685used to compute a value, and not just any expression will do---it 12686must be a variable, field, or array element so that @code{sub} can 12687store a modified value there. If this argument is omitted, then the 12688default is to use and alter @code{$0}.@footnote{Note that this means 12689that the record will first be regenerated using the value of @code{OFS} if 12690any fields have been changed, and that the fields will be updated 12691after the substituion, even if the operation is a ``no-op'' such 12692as @samp{sub(/^/, "")}.} 12693For example: 12694 12695@example 12696str = "water, water, everywhere" 12697sub(/at/, "ith", str) 12698@end example 12699 12700@noindent 12701sets @code{str} to @w{@code{"wither, water, everywhere"}}, by replacing the 12702leftmost longest occurrence of @samp{at} with @samp{ith}. 12703 12704The @code{sub} function returns the number of substitutions made (either 12705one or zero). 12706 12707If the special character @samp{&} appears in @var{replacement}, it 12708stands for the precise substring that was matched by @var{regexp}. (If 12709the regexp can match more than one string, then this precise substring 12710may vary.) For example: 12711 12712@example 12713@{ sub(/candidate/, "& and his wife"); print @} 12714@end example 12715 12716@noindent 12717changes the first occurrence of @samp{candidate} to @samp{candidate 12718and his wife} on each input line. 12719Here is another example: 12720 12721@example 12722$ awk 'BEGIN @{ 12723> str = "daabaaa" 12724> sub(/a+/, "C&C", str) 12725> print str 12726> @}' 12727@print{} dCaaCbaaa 12728@end example 12729 12730@noindent 12731This shows how @samp{&} can represent a nonconstant string and also 12732illustrates the ``leftmost, longest'' rule in regexp matching 12733(@pxref{Leftmost Longest}). 12734 12735The effect of this special character (@samp{&}) can be turned off by putting a 12736backslash before it in the string. As usual, to insert one backslash in 12737the string, you must write two backslashes. Therefore, write @samp{\\&} 12738in a string constant to include a literal @samp{&} in the replacement. 12739For example, the following shows how to replace the first @samp{|} on each line with 12740an @samp{&}: 12741 12742@example 12743@{ sub(/\|/, "\\&"); print @} 12744@end example 12745 12746@cindex @code{sub} function, arguments of 12747@cindex @code{gsub} function, arguments of 12748As mentioned, the third argument to @code{sub} must 12749be a variable, field or array reference. 12750Some versions of @command{awk} allow the third argument to 12751be an expression that is not an lvalue. In such a case, @code{sub} 12752still searches for the pattern and returns zero or one, but the result of 12753the substitution (if any) is thrown away because there is no place 12754to put it. Such versions of @command{awk} accept expressions 12755such as the following: 12756 12757@example 12758sub(/USA/, "United States", "the USA and Canada") 12759@end example 12760 12761@noindent 12762@cindex troubleshooting, @code{gsub}/@code{sub} functions 12763For historical compatibility, @command{gawk} accepts erroneous code, 12764such as in the previous example. However, using any other nonchangeable 12765object as the third parameter causes a fatal error and your program 12766will not run. 12767 12768Finally, if the @var{regexp} is not a regexp constant, it is converted into a 12769string, and then the value of that string is treated as the regexp to match. 12770 12771@item gsub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]}) 12772@cindex @code{gsub} function 12773This is similar to the @code{sub} function, except @code{gsub} replaces 12774@emph{all} of the longest, leftmost, @emph{nonoverlapping} matching 12775substrings it can find. The @samp{g} in @code{gsub} stands for 12776``global,'' which means replace everywhere. For example: 12777 12778@example 12779@{ gsub(/Britain/, "United Kingdom"); print @} 12780@end example 12781 12782@noindent 12783replaces all occurrences of the string @samp{Britain} with @samp{United 12784Kingdom} for all input records. 12785 12786The @code{gsub} function returns the number of substitutions made. If 12787the variable to search and alter (@var{target}) is 12788omitted, then the entire input record (@code{$0}) is used. 12789As in @code{sub}, the characters @samp{&} and @samp{\} are special, 12790and the third argument must be assignable. 12791 12792@item gensub(@var{regexp}, @var{replacement}, @var{how} @r{[}, @var{target}@r{]}) # 12793@cindex @code{gensub} function (@command{gawk}) 12794@code{gensub} is a general substitution function. Like @code{sub} and 12795@code{gsub}, it searches the target string @var{target} for matches of 12796the regular expression @var{regexp}. Unlike @code{sub} and @code{gsub}, 12797the modified string is returned as the result of the function and the 12798original target string is @emph{not} changed. If @var{how} is a string 12799beginning with @samp{g} or @samp{G}, then it replaces all matches of 12800@var{regexp} with @var{replacement}. Otherwise, @var{how} is treated 12801as a number that indicates which match of @var{regexp} to replace. If 12802no @var{target} is supplied, @code{$0} is used. 12803 12804@code{gensub} provides an additional feature that is not available 12805in @code{sub} or @code{gsub}: the ability to specify components of a 12806regexp in the replacement text. This is done by using parentheses in 12807the regexp to mark the components and then specifying @samp{\@var{N}} 12808in the replacement text, where @var{N} is a digit from 1 to 9. 12809For example: 12810 12811@example 12812$ gawk ' 12813> BEGIN @{ 12814> a = "abc def" 12815> b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a) 12816> print b 12817> @}' 12818@print{} def abc 12819@end example 12820 12821@noindent 12822As with @code{sub}, you must type two backslashes in order 12823to get one into the string. 12824In the replacement text, the sequence @samp{\0} represents the entire 12825matched text, as does the character @samp{&}. 12826 12827The following example shows how you can use the third argument to control 12828which match of the regexp should be changed: 12829 12830@example 12831$ echo a b c a b c | 12832> gawk '@{ print gensub(/a/, "AA", 2) @}' 12833@print{} a b c AA b c 12834@end example 12835 12836In this case, @code{$0} is used as the default target string. 12837@code{gensub} returns the new string as its result, which is 12838passed directly to @code{print} for printing. 12839 12840@c @cindex automatic warnings 12841@c @cindex warnings, automatic 12842If the @var{how} argument is a string that does not begin with @samp{g} or 12843@samp{G}, or if it is a number that is less than or equal to zero, only one 12844substitution is performed. If @var{how} is zero, @command{gawk} issues 12845a warning message. 12846 12847If @var{regexp} does not match @var{target}, @code{gensub}'s return value 12848is the original unchanged value of @var{target}. 12849 12850@code{gensub} is a @command{gawk} extension; it is not available 12851in compatibility mode (@pxref{Options}). 12852 12853@item substr(@var{string}, @var{start} @r{[}, @var{length}@r{]}) 12854@cindex @code{substr} function 12855This returns a @var{length}-character-long substring of @var{string}, 12856starting at character number @var{start}. The first character of a 12857string is character number one.@footnote{This is different from 12858C and C++, in which the first character is number zero.} 12859For example, @code{substr("washington", 5, 3)} returns @code{"ing"}. 12860 12861If @var{length} is not present, this function returns the whole suffix of 12862@var{string} that begins at character number @var{start}. For example, 12863@code{substr("washington", 5)} returns @code{"ington"}. The whole 12864suffix is also returned 12865if @var{length} is greater than the number of characters remaining 12866in the string, counting from character @var{start}. 12867 12868If @var{start} is less than one, @code{substr} treats it as 12869if it was one. (POSIX doesn't specify what to do in this case: 12870Unix @command{awk} acts this way, and therefore @command{gawk} 12871does too.) 12872If @var{start} is greater than the number of characters 12873in the string, @code{substr} returns the null string. 12874Similarly, if @var{length} is present but less than or equal to zero, 12875the null string is returned. 12876 12877@cindex troubleshooting, @code{substr} function 12878The string returned by @code{substr} @emph{cannot} be 12879assigned. Thus, it is a mistake to attempt to change a portion of 12880a string, as shown in the following example: 12881 12882@example 12883string = "abcdef" 12884# try to get "abCDEf", won't work 12885substr(string, 3, 3) = "CDE" 12886@end example 12887 12888@noindent 12889It is also a mistake to use @code{substr} as the third argument 12890of @code{sub} or @code{gsub}: 12891 12892@example 12893gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG 12894@end example 12895 12896@cindex portability, @code{substr} function 12897(Some commercial versions of @command{awk} do in fact let you use 12898@code{substr} this way, but doing so is not portable.) 12899 12900If you need to replace bits and pieces of a string, combine @code{substr} 12901with string concatenation, in the following manner: 12902 12903@example 12904string = "abcdef" 12905@dots{} 12906string = substr(string, 1, 2) "CDE" substr(string, 6) 12907@end example 12908 12909@cindex case sensitivity, converting case 12910@cindex converting, case 12911@item tolower(@var{string}) 12912@cindex @code{tolower} function 12913This returns a copy of @var{string}, with each uppercase character 12914in the string replaced with its corresponding lowercase character. 12915Nonalphabetic characters are left unchanged. For example, 12916@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}. 12917 12918@item toupper(@var{string}) 12919@cindex @code{toupper} function 12920This returns a copy of @var{string}, with each lowercase character 12921in the string replaced with its corresponding uppercase character. 12922Nonalphabetic characters are left unchanged. For example, 12923@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}. 12924@end table 12925 12926@node Gory Details 12927@subsubsection More About @samp{\} and @samp{&} with @code{sub}, @code{gsub}, and @code{gensub} 12928 12929@cindex escape processing, @code{gsub}/@code{gensub}/@code{sub} functions 12930@cindex @code{sub} function, escape processing 12931@cindex @code{gsub} function, escape processing 12932@cindex @code{gensub} function (@command{gawk}), escape processing 12933@cindex @code{\} (backslash), @code{gsub}/@code{gensub}/@code{sub} functions and 12934@cindex backslash (@code{\}), @code{gsub}/@code{gensub}/@code{sub} functions and 12935@cindex @code{&} (ampersand), @code{gsub}/@code{gensub}/@code{sub} functions and 12936@cindex ampersand (@code{&}), @code{gsub}/@code{gensub}/@code{sub} functions and 12937When using @code{sub}, @code{gsub}, or @code{gensub}, and trying to get literal 12938backslashes and ampersands into the replacement text, you need to remember 12939that there are several levels of @dfn{escape processing} going on. 12940 12941First, there is the @dfn{lexical} level, which is when @command{awk} reads 12942your program 12943and builds an internal copy of it that can be executed. 12944Then there is the runtime level, which is when @command{awk} actually scans the 12945replacement string to determine what to generate. 12946 12947At both levels, @command{awk} looks for a defined set of characters that 12948can come after a backslash. At the lexical level, it looks for the 12949escape sequences listed in @ref{Escape Sequences}. 12950Thus, for every @samp{\} that @command{awk} processes at the runtime 12951level, type two backslashes at the lexical level. 12952When a character that is not valid for an escape sequence follows the 12953@samp{\}, Unix @command{awk} and @command{gawk} both simply remove the initial 12954@samp{\} and put the next character into the string. Thus, for 12955example, @code{"a\qb"} is treated as @code{"aqb"}. 12956 12957At the runtime level, the various functions handle sequences of 12958@samp{\} and @samp{&} differently. The situation is (sadly) somewhat complex. 12959Historically, the @code{sub} and @code{gsub} functions treated the two 12960character sequence @samp{\&} specially; this sequence was replaced in 12961the generated text with a single @samp{&}. Any other @samp{\} within 12962the @var{replacement} string that did not precede an @samp{&} was passed 12963through unchanged. To illustrate with a table: 12964 12965@c Thank to Karl Berry for help with the TeX stuff. 12966@tex 12967\vbox{\bigskip 12968% This table has lots of &'s and \'s, so unspecialize them. 12969\catcode`\& = \other \catcode`\\ = \other 12970% But then we need character for escape and tab. 12971@catcode`! = 4 12972@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr 12973 You type!@code{sub} sees!@code{sub} generates@cr 12974@hrulefill!@hrulefill!@hrulefill@cr 12975 @code{\&}! @code{&}!the matched text@cr 12976 @code{\\&}! @code{\&}!a literal @samp{&}@cr 12977 @code{\\\&}! @code{\&}!a literal @samp{&}@cr 12978@code{\\\\&}! @code{\\&}!a literal @samp{\&}@cr 12979@code{\\\\\&}! @code{\\&}!a literal @samp{\&}@cr 12980@code{\\\\\\&}! @code{\\\&}!a literal @samp{\\&}@cr 12981 @code{\\q}! @code{\q}!a literal @samp{\q}@cr 12982} 12983@bigskip} 12984@end tex 12985@ifnottex 12986@display 12987 You type @code{sub} sees @code{sub} generates 12988 -------- ---------- --------------- 12989 @code{\&} @code{&} the matched text 12990 @code{\\&} @code{\&} a literal @samp{&} 12991 @code{\\\&} @code{\&} a literal @samp{&} 12992 @code{\\\\&} @code{\\&} a literal @samp{\&} 12993 @code{\\\\\&} @code{\\&} a literal @samp{\&} 12994@code{\\\\\\&} @code{\\\&} a literal @samp{\\&} 12995 @code{\\q} @code{\q} a literal @samp{\q} 12996@end display 12997@end ifnottex 12998 12999@noindent 13000This table shows both the lexical-level processing, where 13001an odd number of backslashes becomes an even number at the runtime level, 13002as well as the runtime processing done by @code{sub}. 13003(For the sake of simplicity, the rest of the following tables only show the 13004case of even numbers of backslashes entered at the lexical level.) 13005 13006The problem with the historical approach is that there is no way to get 13007a literal @samp{\} followed by the matched text. 13008 13009@c @cindex @command{awk} language, POSIX version 13010@cindex POSIX @command{awk}, functions and, @code{gsub}/@code{sub} 13011The 1992 POSIX standard attempted to fix this problem. The standard 13012says that @code{sub} and @code{gsub} look for either a @samp{\} or an @samp{&} 13013after the @samp{\}. If either one follows a @samp{\}, that character is 13014output literally. The interpretation of @samp{\} and @samp{&} then becomes: 13015 13016@c thanks to Karl Berry for formatting this table 13017@tex 13018\vbox{\bigskip 13019% This table has lots of &'s and \'s, so unspecialize them. 13020\catcode`\& = \other \catcode`\\ = \other 13021% But then we need character for escape and tab. 13022@catcode`! = 4 13023@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr 13024 You type!@code{sub} sees!@code{sub} generates@cr 13025@hrulefill!@hrulefill!@hrulefill@cr 13026 @code{&}! @code{&}!the matched text@cr 13027 @code{\\&}! @code{\&}!a literal @samp{&}@cr 13028@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr 13029@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr 13030} 13031@bigskip} 13032@end tex 13033@ifnottex 13034@display 13035 You type @code{sub} sees @code{sub} generates 13036 -------- ---------- --------------- 13037 @code{&} @code{&} the matched text 13038 @code{\\&} @code{\&} a literal @samp{&} 13039 @code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text 13040@code{\\\\\\&} @code{\\\&} a literal @samp{\&} 13041@end display 13042@end ifnottex 13043 13044@noindent 13045This appears to solve the problem. 13046Unfortunately, the phrasing of the standard is unusual. It 13047says, in effect, that @samp{\} turns off the special meaning of any 13048following character, but for anything other than @samp{\} and @samp{&}, 13049such special meaning is undefined. This wording leads to two problems: 13050 13051@itemize @bullet 13052@item 13053Backslashes must now be doubled in the @var{replacement} string, breaking 13054historical @command{awk} programs. 13055 13056@item 13057To make sure that an @command{awk} program is portable, @emph{every} character 13058in the @var{replacement} string must be preceded with a 13059backslash.@footnote{This consequence was certainly unintended.} 13060@c I can say that, 'cause I was involved in making this change 13061@end itemize 13062 13063The POSIX standard is under revision. 13064Because of the problems just listed, proposed text for the revised standard 13065reverts to rules that correspond more closely to the original existing 13066practice. The proposed rules have special cases that make it possible 13067to produce a @samp{\} preceding the matched text: 13068 13069@tex 13070\vbox{\bigskip 13071% This table has lots of &'s and \'s, so unspecialize them. 13072\catcode`\& = \other \catcode`\\ = \other 13073% But then we need character for escape and tab. 13074@catcode`! = 4 13075@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr 13076 You type!@code{sub} sees!@code{sub} generates@cr 13077@hrulefill!@hrulefill!@hrulefill@cr 13078@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr 13079@code{\\\\&}! @code{\\&}!a literal @samp{\}, followed by the matched text@cr 13080 @code{\\&}! @code{\&}!a literal @samp{&}@cr 13081 @code{\\q}! @code{\q}!a literal @samp{\q}@cr 13082} 13083@bigskip} 13084@end tex 13085@ifinfo 13086@display 13087 You type @code{sub} sees @code{sub} generates 13088 -------- ---------- --------------- 13089@code{\\\\\\&} @code{\\\&} a literal @samp{\&} 13090 @code{\\\\&} @code{\\&} a literal @samp{\}, followed by the matched text 13091 @code{\\&} @code{\&} a literal @samp{&} 13092 @code{\\q} @code{\q} a literal @samp{\q} 13093@end display 13094@end ifinfo 13095 13096In a nutshell, at the runtime level, there are now three special sequences 13097of characters (@samp{\\\&}, @samp{\\&} and @samp{\&}) whereas historically 13098there was only one. However, as in the historical case, any @samp{\} that 13099is not part of one of these three sequences is not special and appears 13100in the output literally. 13101 13102@command{gawk} 3.0 and 3.1 follow these proposed POSIX rules for @code{sub} and 13103@code{gsub}. 13104@c As much as we think it's a lousy idea. You win some, you lose some. Sigh. 13105Whether these proposed rules will actually become codified into the 13106standard is unknown at this point. Subsequent @command{gawk} releases will 13107track the standard and implement whatever the final version specifies; 13108this @value{DOCUMENT} will be updated as 13109well.@footnote{As this @value{DOCUMENT} was being finalized, 13110we learned that the POSIX standard will not use these rules. 13111However, it was too late to change @command{gawk} for the 3.1 release. 13112@command{gawk} behaves as described here.} 13113 13114The rules for @code{gensub} are considerably simpler. At the runtime 13115level, whenever @command{gawk} sees a @samp{\}, if the following character 13116is a digit, then the text that matched the corresponding parenthesized 13117subexpression is placed in the generated output. Otherwise, 13118no matter what character follows the @samp{\}, it 13119appears in the generated text and the @samp{\} does not: 13120 13121@tex 13122\vbox{\bigskip 13123% This table has lots of &'s and \'s, so unspecialize them. 13124\catcode`\& = \other \catcode`\\ = \other 13125% But then we need character for escape and tab. 13126@catcode`! = 4 13127@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr 13128 You type!@code{gensub} sees!@code{gensub} generates@cr 13129@hrulefill!@hrulefill!@hrulefill@cr 13130 @code{&}! @code{&}!the matched text@cr 13131 @code{\\&}! @code{\&}!a literal @samp{&}@cr 13132 @code{\\\\}! @code{\\}!a literal @samp{\}@cr 13133 @code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr 13134@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr 13135 @code{\\q}! @code{\q}!a literal @samp{q}@cr 13136} 13137@bigskip} 13138@end tex 13139@ifnottex 13140@display 13141 You type @code{gensub} sees @code{gensub} generates 13142 -------- ------------- ------------------ 13143 @code{&} @code{&} the matched text 13144 @code{\\&} @code{\&} a literal @samp{&} 13145 @code{\\\\} @code{\\} a literal @samp{\} 13146 @code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text 13147@code{\\\\\\&} @code{\\\&} a literal @samp{\&} 13148 @code{\\q} @code{\q} a literal @samp{q} 13149@end display 13150@end ifnottex 13151 13152Because of the complexity of the lexical and runtime level processing 13153and the special cases for @code{sub} and @code{gsub}, 13154we recommend the use of @command{gawk} and @code{gensub} when you have 13155to do substitutions. 13156 13157@c fakenode --- for prepinfo 13158@subheading Advanced Notes: Matching the Null String 13159@c last comma does NOT start tertiary 13160@cindex advanced features, null strings, matching 13161@cindex matching, null strings 13162@cindex null strings, matching 13163@c last comma in next two is part of tertiary 13164@cindex @code{*} (asterisk), @code{*} operator, null strings, matching 13165@cindex asterisk (@code{*}), @code{*} operator, null strings, matching 13166 13167In @command{awk}, the @samp{*} operator can match the null string. 13168This is particularly important for the @code{sub}, @code{gsub}, 13169and @code{gensub} functions. For example: 13170 13171@example 13172$ echo abc | awk '@{ gsub(/m*/, "X"); print @}' 13173@print{} XaXbXcX 13174@end example 13175 13176@noindent 13177Although this makes a certain amount of sense, it can be surprising. 13178 13179@node I/O Functions 13180@subsection Input/Output Functions 13181 13182The following functions relate to input/output (I/O). 13183Optional parameters are enclosed in square brackets ([ ]): 13184 13185@table @code 13186@item close(@var{filename} @r{[}, @var{how}@r{]}) 13187@cindex @code{close} function 13188@cindex files, closing 13189Close the file @var{filename} for input or output. Alternatively, the 13190argument may be a shell command that was used for creating a coprocess, or 13191for redirecting to or from a pipe; then the coprocess or pipe is closed. 13192@xref{Close Files And Pipes}, 13193for more information. 13194 13195When closing a coprocess, it is occasionally useful to first close 13196one end of the two-way pipe and then to close the other. This is done 13197by providing a second argument to @code{close}. This second argument 13198should be one of the two string values @code{"to"} or @code{"from"}, 13199indicating which end of the pipe to close. Case in the string does 13200not matter. 13201@xref{Two-way I/O}, 13202which discusses this feature in more detail and gives an example. 13203 13204@item fflush(@r{[}@var{filename}@r{]}) 13205@cindex @code{fflush} function 13206Flush any buffered output associated with @var{filename}, which is either a 13207file opened for writing or a shell command for redirecting output to 13208a pipe or coprocess. 13209 13210@cindex portability, @code{fflush} function and 13211@cindex buffers, flushing 13212@cindex output, buffering 13213Many utility programs @dfn{buffer} their output; i.e., they save information 13214to write to a disk file or terminal in memory until there is enough 13215for it to be worthwhile to send the data to the output device. 13216This is often more efficient than writing 13217every little bit of information as soon as it is ready. However, sometimes 13218it is necessary to force a program to @dfn{flush} its buffers; that is, 13219write the information to its destination, even if a buffer is not full. 13220This is the purpose of the @code{fflush} function---@command{gawk} also 13221buffers its output and the @code{fflush} function forces 13222@command{gawk} to flush its buffers. 13223 13224@code{fflush} was added to the Bell Laboratories research 13225version of @command{awk} in 1994; it is not part of the POSIX standard and is 13226not available if @option{--posix} has been specified on the 13227command line (@pxref{Options}). 13228 13229@cindex @command{gawk}, @code{fflush} function in 13230@command{gawk} extends the @code{fflush} function in two ways. The first 13231is to allow no argument at all. In this case, the buffer for the 13232standard output is flushed. The second is to allow the null string 13233(@w{@code{""}}) as the argument. In this case, the buffers for 13234@emph{all} open output files and pipes are flushed. 13235 13236@c @cindex automatic warnings 13237@c @cindex warnings, automatic 13238@cindex troubleshooting, @code{fflush} function 13239@code{fflush} returns zero if the buffer is successfully flushed; 13240otherwise, it returns @minus{}1. 13241In the case where all buffers are flushed, the return value is zero 13242only if all buffers were flushed successfully. Otherwise, it is 13243@minus{}1, and @command{gawk} warns about the problem @var{filename}. 13244 13245@command{gawk} also issues a warning message if you attempt to flush 13246a file or pipe that was opened for reading (such as with @code{getline}), 13247or if @var{filename} is not an open file, pipe, or coprocess. 13248In such a case, @code{fflush} returns @minus{}1, as well. 13249 13250@item system(@var{command}) 13251@cindex @code{system} function 13252@cindex interacting with other programs 13253Executes operating-system 13254commands and then returns to the @command{awk} program. The @code{system} 13255function executes the command given by the string @var{command}. 13256It returns the status returned by the command that was executed as 13257its value. 13258 13259For example, if the following fragment of code is put in your @command{awk} 13260program: 13261 13262@example 13263END @{ 13264 system("date | mail -s 'awk run done' root") 13265@} 13266@end example 13267 13268@noindent 13269the system administrator is sent mail when the @command{awk} program 13270finishes processing input and begins its end-of-input processing. 13271 13272Note that redirecting @code{print} or @code{printf} into a pipe is often 13273enough to accomplish your task. If you need to run many commands, it 13274is more efficient to simply print them down a pipeline to the shell: 13275 13276@example 13277while (@var{more stuff to do}) 13278 print @var{command} | "/bin/sh" 13279close("/bin/sh") 13280@end example 13281 13282@noindent 13283@cindex troubleshooting, @code{system} function 13284However, if your @command{awk} 13285program is interactive, @code{system} is useful for cranking up large 13286self-contained programs, such as a shell or an editor. 13287Some operating systems cannot implement the @code{system} function. 13288@code{system} causes a fatal error if it is not supported. 13289@end table 13290 13291@c fakenode --- for prepinfo 13292@subheading Advanced Notes: Interactive Versus Noninteractive Buffering 13293@cindex advanced features, buffering 13294@cindex buffering, interactive vs. noninteractive 13295 13296As a side point, buffering issues can be even more confusing, depending 13297upon whether your program is @dfn{interactive}, i.e., communicating 13298with a user sitting at a keyboard.@footnote{A program is interactive 13299if the standard output is connected 13300to a terminal device.} 13301 13302@c Thanks to Walter.Mecky@dresdnerbank.de for this example, and for 13303@c motivating me to write this section. 13304Interactive programs generally @dfn{line buffer} their output; i.e., they 13305write out every line. Noninteractive programs wait until they have 13306a full buffer, which may be many lines of output. 13307Here is an example of the difference: 13308 13309@example 13310$ awk '@{ print $1 + $2 @}' 133111 1 13312@print{} 2 133132 3 13314@print{} 5 13315@kbd{@value{CTL}-d} 13316@end example 13317 13318@noindent 13319Each line of output is printed immediately. Compare that behavior 13320with this example: 13321 13322@example 13323$ awk '@{ print $1 + $2 @}' | cat 133241 1 133252 3 13326@kbd{@value{CTL}-d} 13327@print{} 2 13328@print{} 5 13329@end example 13330 13331@noindent 13332Here, no output is printed until after the @kbd{@value{CTL}-d} is typed, because 13333it is all buffered and sent down the pipe to @command{cat} in one shot. 13334 13335@c fakenode --- for prepinfo 13336@subheading Advanced Notes: Controlling Output Buffering with @code{system} 13337@cindex advanced features, buffering 13338@cindex buffers, flushing 13339@cindex buffering, input/output 13340@cindex output, buffering 13341 13342The @code{fflush} function provides explicit control over output buffering for 13343individual files and pipes. However, its use is not portable to many other 13344@command{awk} implementations. An alternative method to flush output 13345buffers is to call @code{system} with a null string as its argument: 13346 13347@example 13348system("") # flush output 13349@end example 13350 13351@noindent 13352@command{gawk} treats this use of the @code{system} function as a special 13353case and is smart enough not to run a shell (or other command 13354interpreter) with the empty command. Therefore, with @command{gawk}, this 13355idiom is not only useful, it is also efficient. While this method should work 13356with other @command{awk} implementations, it does not necessarily avoid 13357starting an unnecessary shell. (Other implementations may only 13358flush the buffer associated with the standard output and not necessarily 13359all buffered output.) 13360 13361If you think about what a programmer expects, it makes sense that 13362@code{system} should flush any pending output. The following program: 13363 13364@example 13365BEGIN @{ 13366 print "first print" 13367 system("echo system echo") 13368 print "second print" 13369@} 13370@end example 13371 13372@noindent 13373must print: 13374 13375@example 13376first print 13377system echo 13378second print 13379@end example 13380 13381@noindent 13382and not: 13383 13384@example 13385system echo 13386first print 13387second print 13388@end example 13389 13390If @command{awk} did not flush its buffers before calling @code{system}, 13391you would see the latter (undesirable) output. 13392 13393@node Time Functions 13394@subsection Using @command{gawk}'s Timestamp Functions 13395 13396@c STARTOFRANGE tst 13397@cindex timestamps 13398@c STARTOFRANGE logftst 13399@cindex log files, timestamps in 13400@c last comma does NOT start tertiary 13401@c STARTOFRANGE filogtst 13402@cindex files, log, timestamps in 13403@c STARTOFRANGE gawtst 13404@cindex @command{gawk}, timestamps 13405@cindex POSIX @command{awk}, timestamps and 13406@code{awk} programs are commonly used to process log files 13407containing timestamp information, indicating when a 13408particular log record was written. Many programs log their timestamp 13409in the form returned by the @code{time} system call, which is the 13410number of seconds since a particular epoch. On POSIX-compliant systems, 13411it is the number of seconds since 134121970-01-01 00:00:00 UTC, not counting leap seconds.@footnote{@xref{Glossary}, 13413especially the entries ``Epoch'' and ``UTC.''} 13414All known POSIX-compliant systems support timestamps from 0 through 13415@math{2^31 - 1}, which is sufficient to represent times through 134162038-01-19 03:14:07 UTC. Many systems support a wider range of timestamps, 13417including negative timestamps that represent times before the 13418epoch. 13419 13420@cindex @command{date} utility, GNU 13421@cindex time, retrieving 13422In order to make it easier to process such log files and to produce 13423useful reports, @command{gawk} provides the following functions for 13424working with timestamps. They are @command{gawk} extensions; they are 13425not specified in the POSIX standard, nor are they in any other known 13426version of @command{awk}.@footnote{The GNU @command{date} utility can 13427also do many of the things described here. Its use may be preferable 13428for simple time-related operations in shell scripts.} 13429Optional parameters are enclosed in square brackets ([ ]): 13430 13431@table @code 13432@item systime() 13433@cindex @code{systime} function (@command{gawk}) 13434@cindex timestamps 13435This function returns the current time as the number of seconds since 13436the system epoch. On POSIX systems, this is the number of seconds 13437since 1970-01-01 00:00:00 UTC, not counting leap seconds. 13438It may be a different number on 13439other systems. 13440 13441@item mktime(@var{datespec}) 13442@cindex @code{mktime} function (@command{gawk}) 13443This function turns @var{datespec} into a timestamp in the same form 13444as is returned by @code{systime}. It is similar to the function of the 13445same name in ISO C. The argument, @var{datespec}, is a string of the form 13446@w{@code{"@var{YYYY} @var{MM} @var{DD} @var{HH} @var{MM} @var{SS} [@var{DST}]"}}. 13447The string consists of six or seven numbers representing, respectively, 13448the full year including century, the month from 1 to 12, the day of the month 13449from 1 to 31, the hour of the day from 0 to 23, the minute from 0 to 1345059, the second from 0 to 60,@footnote{Occasionally there are 13451minutes in a year with a leap second, which is why the 13452seconds can go up to 60.} 13453and an optional daylight-savings flag. 13454 13455The values of these numbers need not be within the ranges specified; 13456for example, an hour of @minus{}1 means 1 hour before midnight. 13457The origin-zero Gregorian calendar is assumed, with year 0 preceding 13458year 1 and year @minus{}1 preceding year 0. 13459The time is assumed to be in the local timezone. 13460If the daylight-savings flag is positive, the time is assumed to be 13461daylight savings time; if zero, the time is assumed to be standard 13462time; and if negative (the default), @code{mktime} attempts to determine 13463whether daylight savings time is in effect for the specified time. 13464 13465If @var{datespec} does not contain enough elements or if the resulting time 13466is out of range, @code{mktime} returns @minus{}1. 13467 13468@item strftime(@r{[}@var{format} @r{[}, @var{timestamp}@r{]]}) 13469@c STARTOFRANGE strf 13470@cindex @code{strftime} function (@command{gawk}) 13471This function returns a string. It is similar to the function of the 13472same name in ISO C. The time specified by @var{timestamp} is used to 13473produce a string, based on the contents of the @var{format} string. 13474The @var{timestamp} is in the same format as the value returned by the 13475@code{systime} function. If no @var{timestamp} argument is supplied, 13476@command{gawk} uses the current time of day as the timestamp. 13477If no @var{format} argument is supplied, @code{strftime} uses 13478@code{@w{"%a %b %d %H:%M:%S %Z %Y"}}. This format string produces 13479output that is (almost) equivalent to that of the @command{date} utility. 13480(Versions of @command{gawk} prior to 3.0 require the @var{format} argument.) 13481@end table 13482 13483The @code{systime} function allows you to compare a timestamp from a 13484log file with the current time of day. In particular, it is easy to 13485determine how long ago a particular record was logged. It also allows 13486you to produce log records using the ``seconds since the epoch'' format. 13487 13488@cindex converting, dates to timestamps 13489@cindex dates, converting to timestamps 13490@cindex timestamps, converting dates to 13491The @code{mktime} function allows you to convert a textual representation 13492of a date and time into a timestamp. This makes it easy to do before/after 13493comparisons of dates and times, particularly when dealing with date and 13494time data coming from an external source, such as a log file. 13495 13496The @code{strftime} function allows you to easily turn a timestamp 13497into human-readable information. It is similar in nature to the @code{sprintf} 13498function 13499(@pxref{String Functions}), 13500in that it copies nonformat specification characters verbatim to the 13501returned string, while substituting date and time values for format 13502specifications in the @var{format} string. 13503 13504@cindex format specifiers, @code{strftime} function (@command{gawk}) 13505@code{strftime} is guaranteed by the 1999 ISO C standard@footnote{As this 13506is a recent standard, not every system's @code{strftime} necessarily 13507supports all of the conversions listed here.} 13508to support the following date format specifications: 13509 13510@table @code 13511@item %a 13512The locale's abbreviated weekday name. 13513 13514@item %A 13515The locale's full weekday name. 13516 13517@item %b 13518The locale's abbreviated month name. 13519 13520@item %B 13521The locale's full month name. 13522 13523@item %c 13524The locale's ``appropriate'' date and time representation. 13525(This is @samp{%A %B %d %T %Y} in the @code{"C"} locale.) 13526 13527@item %C 13528The century. This is the year divided by 100 and truncated to the next 13529lower integer. 13530 13531@item %d 13532The day of the month as a decimal number (01--31). 13533 13534@item %D 13535Equivalent to specifying @samp{%m/%d/%y}. 13536 13537@item %e 13538The day of the month, padded with a space if it is only one digit. 13539 13540@item %F 13541Equivalent to specifying @samp{%Y-%m-%d}. 13542This is the ISO 8601 date format. 13543 13544@item %g 13545The year modulo 100 of the ISO week number, as a decimal number (00--99). 13546For example, January 1, 1993 is in week 53 of 1992. Thus, the year 13547of its ISO week number is 1992, even though its year is 1993. 13548Similarly, December 31, 1973 is in week 1 of 1974. Thus, the year 13549of its ISO week number is 1974, even though its year is 1973. 13550 13551@item %G 13552The full year of the ISO week number, as a decimal number. 13553 13554@item %h 13555Equivalent to @samp{%b}. 13556 13557@item %H 13558The hour (24-hour clock) as a decimal number (00--23). 13559 13560@item %I 13561The hour (12-hour clock) as a decimal number (01--12). 13562 13563@item %j 13564The day of the year as a decimal number (001--366). 13565 13566@item %m 13567The month as a decimal number (01--12). 13568 13569@item %M 13570The minute as a decimal number (00--59). 13571 13572@item %n 13573A newline character (ASCII LF). 13574 13575@item %p 13576The locale's equivalent of the AM/PM designations associated 13577with a 12-hour clock. 13578 13579@item %r 13580The locale's 12-hour clock time. 13581(This is @samp{%I:%M:%S %p} in the @code{"C"} locale.) 13582 13583@item %R 13584Equivalent to specifying @samp{%H:%M}. 13585 13586@item %S 13587The second as a decimal number (00--60). 13588 13589@item %t 13590A TAB character. 13591 13592@item %T 13593Equivalent to specifying @samp{%H:%M:%S}. 13594 13595@item %u 13596The weekday as a decimal number (1--7). Monday is day one. 13597 13598@item %U 13599The week number of the year (the first Sunday as the first day of week one) 13600as a decimal number (00--53). 13601 13602@c @cindex ISO 8601 13603@item %V 13604The week number of the year (the first Monday as the first 13605day of week one) as a decimal number (01--53). 13606The method for determining the week number is as specified by ISO 8601. 13607(To wit: if the week containing January 1 has four or more days in the 13608new year, then it is week one; otherwise it is week 53 of the previous year 13609and the next week is week one.) 13610 13611@item %w 13612The weekday as a decimal number (0--6). Sunday is day zero. 13613 13614@item %W 13615The week number of the year (the first Monday as the first day of week one) 13616as a decimal number (00--53). 13617 13618@item %x 13619The locale's ``appropriate'' date representation. 13620(This is @samp{%A %B %d %Y} in the @code{"C"} locale.) 13621 13622@item %X 13623The locale's ``appropriate'' time representation. 13624(This is @samp{%T} in the @code{"C"} locale.) 13625 13626@item %y 13627The year modulo 100 as a decimal number (00--99). 13628 13629@item %Y 13630The full year as a decimal number (e.g., 1995). 13631 13632@c @cindex RFC 822 13633@c @cindex RFC 1036 13634@item %z 13635The timezone offset in a +HHMM format (e.g., the format necessary to 13636produce RFC 822/RFC 1036 date headers). 13637 13638@item %Z 13639The time zone name or abbreviation; no characters if 13640no time zone is determinable. 13641 13642@item %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH 13643@itemx %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy 13644``Alternate representations'' for the specifications 13645that use only the second letter (@samp{%c}, @samp{%C}, 13646and so on).@footnote{If you don't understand any of this, don't worry about 13647it; these facilities are meant to make it easier to ``internationalize'' 13648programs. 13649Other internationalization features are described in 13650@ref{Internationalization}.} 13651(These facilitate compliance with the POSIX @command{date} utility.) 13652 13653@item %% 13654A literal @samp{%}. 13655@end table 13656 13657If a conversion specifier is not one of the above, the behavior is 13658undefined.@footnote{This is because ISO C leaves the 13659behavior of the C version of @code{strftime} undefined and @command{gawk} 13660uses the system's version of @code{strftime} if it's there. 13661Typically, the conversion specifier either does not appear in the 13662returned string or appears literally.} 13663 13664@c @cindex locale, definition of 13665Informally, a @dfn{locale} is the geographic place in which a program 13666is meant to run. For example, a common way to abbreviate the date 13667September 4, 1991 in the United States is ``9/4/91.'' 13668In many countries in Europe, however, it is abbreviated ``4.9.91.'' 13669Thus, the @samp{%x} specification in a @code{"US"} locale might produce 13670@samp{9/4/91}, while in a @code{"EUROPE"} locale, it might produce 13671@samp{4.9.91}. The ISO C standard defines a default @code{"C"} 13672locale, which is an environment that is typical of what most C programmers 13673are used to. 13674 13675A public-domain C version of @code{strftime} is supplied with @command{gawk} 13676for systems that are not yet fully standards-compliant. 13677It supports all of the just listed format specifications. 13678If that version is 13679used to compile @command{gawk} (@pxref{Installation}), 13680then the following additional format specifications are available: 13681 13682@table @code 13683@item %k 13684The hour (24-hour clock) as a decimal number (0--23). 13685Single-digit numbers are padded with a space. 13686 13687@item %l 13688The hour (12-hour clock) as a decimal number (1--12). 13689Single-digit numbers are padded with a space. 13690 13691@item %N 13692The ``Emperor/Era'' name. 13693Equivalent to @code{%C}. 13694 13695@item %o 13696The ``Emperor/Era'' year. 13697Equivalent to @code{%y}. 13698 13699@item %s 13700The time as a decimal timestamp in seconds since the epoch. 13701 13702@item %v 13703The date in VMS format (e.g., @samp{20-JUN-1991}). 13704@end table 13705@c ENDOFRANGE strf 13706 13707Additionally, the alternate representations are recognized but their 13708normal representations are used. 13709 13710@cindex @code{date} utility, POSIX 13711@cindex POSIX @command{awk}, @code{date} utility and 13712This example is an @command{awk} implementation of the POSIX 13713@command{date} utility. Normally, the @command{date} utility prints the 13714current date and time of day in a well-known format. However, if you 13715provide an argument to it that begins with a @samp{+}, @command{date} 13716copies nonformat specifier characters to the standard output and 13717interprets the current time according to the format specifiers in 13718the string. For example: 13719 13720@example 13721$ date '+Today is %A, %B %d, %Y.' 13722@print{} Today is Thursday, September 14, 2000. 13723@end example 13724 13725Here is the @command{gawk} version of the @command{date} utility. 13726It has a shell ``wrapper'' to handle the @option{-u} option, 13727which requires that @command{date} run as if the time zone 13728is set to UTC: 13729 13730@example 13731#! /bin/sh 13732# 13733# date --- approximate the P1003.2 'date' command 13734 13735case $1 in 13736-u) TZ=UTC0 # use UTC 13737 export TZ 13738 shift ;; 13739esac 13740 13741@c FIXME: One day, change %d to %e, when C 99 is common. 13742gawk 'BEGIN @{ 13743 format = "%a %b %d %H:%M:%S %Z %Y" 13744 exitval = 0 13745 13746 if (ARGC > 2) 13747 exitval = 1 13748 else if (ARGC == 2) @{ 13749 format = ARGV[1] 13750 if (format ~ /^\+/) 13751 format = substr(format, 2) # remove leading + 13752 @} 13753 print strftime(format) 13754 exit exitval 13755@}' "$@@" 13756@end example 13757@c ENDOFRANGE tst 13758@c ENDOFRANGE logftst 13759@c ENDOFRANGE filogtst 13760@c ENDOFRANGE gawtst 13761 13762@node Bitwise Functions 13763@subsection Bit-Manipulation Functions of @command{gawk} 13764@c STARTOFRANGE bit 13765@cindex bitwise, operations 13766@c STARTOFRANGE and 13767@cindex AND bitwise operation 13768@c STARTOFRANGE oro 13769@cindex OR bitwise operation 13770@c STARTOFRANGE xor 13771@cindex XOR bitwise operation 13772@c STARTOFRANGE opbit 13773@cindex operations, bitwise 13774@quotation 13775@i{I can explain it for you, but I can't understand it for you.}@* 13776Anonymous 13777@end quotation 13778 13779Many languages provide the ability to perform @dfn{bitwise} operations 13780on two integer numbers. In other words, the operation is performed on 13781each successive pair of bits in the operands. 13782Three common operations are bitwise AND, OR, and XOR. 13783The operations are described by the following table: 13784 13785@ifnottex 13786@display 13787 Bit Operator 13788 | AND | OR | XOR 13789 |---+---+---+---+---+--- 13790Operands | 0 | 1 | 0 | 1 | 0 | 1 13791----------+---+---+---+---+---+--- 13792 0 | 0 0 | 0 1 | 0 1 13793 1 | 0 1 | 1 1 | 1 0 13794@end display 13795@end ifnottex 13796@tex 13797\centerline{ 13798\vbox{\bigskip % space above the table (about 1 linespace) 13799% Because we have vertical rules, we can't let TeX insert interline space 13800% in its usual way. 13801\offinterlineskip 13802\halign{\strut\hfil#\quad\hfil % operands 13803 &\vrule#&\quad#\quad % rule, 0 (of and) 13804 &\vrule#&\quad#\quad % rule, 1 (of and) 13805 &\vrule# % rule between and and or 13806 &\quad#\quad % 0 (of or) 13807 &\vrule#&\quad#\quad % rule, 1 (of of) 13808 &\vrule# % rule between or and xor 13809 &\quad#\quad % 0 of xor 13810 &\vrule#&\quad#\quad % rule, 1 of xor 13811 \cr 13812&\omit&\multispan{11}\hfil\bf Bit operator\hfil\cr 13813\noalign{\smallskip} 13814& &\multispan3\hfil AND\hfil&&\multispan3\hfil OR\hfil 13815 &&\multispan3\hfil XOR\hfil\cr 13816\bf Operands&&0&&1&&0&&1&&0&&1\cr 13817\noalign{\hrule} 13818\omit&height 2pt&&\omit&&&&\omit&&&&\omit\cr 13819\noalign{\hrule height0pt}% without this the rule does not extend; why? 138200&&0&\omit&0&&0&\omit&1&&0&\omit&1\cr 138211&&0&\omit&1&&1&\omit&1&&1&\omit&0\cr 13822}}} 13823@end tex 13824 13825@cindex bitwise, complement 13826@cindex complement, bitwise 13827As you can see, the result of an AND operation is 1 only when @emph{both} 13828bits are 1. 13829The result of an OR operation is 1 if @emph{either} bit is 1. 13830The result of an XOR operation is 1 if either bit is 1, 13831but not both. 13832The next operation is the @dfn{complement}; the complement of 1 is 0 and 13833the complement of 0 is 1. Thus, this operation ``flips'' all the bits 13834of a given value. 13835 13836@cindex bitwise, shift 13837@cindex left shift, bitwise 13838@cindex right shift, bitwise 13839@cindex shift, bitwise 13840Finally, two other common operations are to shift the bits left or right. 13841For example, if you have a bit string @samp{10111001} and you shift it 13842right by three bits, you end up with @samp{00010111}.@footnote{This example 13843shows that 0's come in on the left side. For @command{gawk}, this is 13844always true, but in some languages, it's possible to have the left side 13845fill with 1's. Caveat emptor.} 13846@c Purposely decided to use 0's and 1's here. 2/2001. 13847If you start over 13848again with @samp{10111001} and shift it left by three bits, you end up 13849with @samp{11001000}. 13850@command{gawk} provides built-in functions that implement the 13851bitwise operations just described. They are: 13852 13853@ignore 13854@table @code 13855@cindex @code{and} function (@command{gawk}) 13856@item and(@var{v1}, @var{v2}) 13857Return the bitwise AND of the values provided by @var{v1} and @var{v2}. 13858 13859@cindex @code{or} function (@command{gawk}) 13860@item or(@var{v1}, @var{v2}) 13861Return the bitwise OR of the values provided by @var{v1} and @var{v2}. 13862 13863@cindex @code{xor} function (@command{gawk}) 13864@item xor(@var{v1}, @var{v2}) 13865Return the bitwise XOR of the values provided by @var{v1} and @var{v2}. 13866 13867@cindex @code{compl} function (@command{gawk}) 13868@item compl(@var{val}) 13869Return the bitwise complement of @var{val}. 13870 13871@cindex @code{lshift} function (@command{gawk}) 13872@item lshift(@var{val}, @var{count}) 13873Return the value of @var{val}, shifted left by @var{count} bits. 13874 13875@cindex @code{rshift} function (@command{gawk}) 13876@item rshift(@var{val}, @var{count}) 13877Return the value of @var{val}, shifted right by @var{count} bits. 13878@end table 13879@end ignore 13880 13881@cindex @command{gawk}, bitwise operations in 13882@multitable {@code{rshift(@var{val}, @var{count})}} {Return the value of @var{val}, shifted right by @var{count} bits.} 13883@cindex @code{and} function (@command{gawk}) 13884@item @code{and(@var{v1}, @var{v2})} 13885@tab Returns the bitwise AND of the values provided by @var{v1} and @var{v2}. 13886 13887@cindex @code{or} function (@command{gawk}) 13888@item @code{or(@var{v1}, @var{v2})} 13889@tab Returns the bitwise OR of the values provided by @var{v1} and @var{v2}. 13890 13891@cindex @code{xor} function (@command{gawk}) 13892@item @code{xor(@var{v1}, @var{v2})} 13893@tab Returns the bitwise XOR of the values provided by @var{v1} and @var{v2}. 13894 13895@cindex @code{compl} function (@command{gawk}) 13896@item @code{compl(@var{val})} 13897@tab Returns the bitwise complement of @var{val}. 13898 13899@cindex @code{lshift} function (@command{gawk}) 13900@item @code{lshift(@var{val}, @var{count})} 13901@tab Returns the value of @var{val}, shifted left by @var{count} bits. 13902 13903@cindex @code{rshift} function (@command{gawk}) 13904@item @code{rshift(@var{val}, @var{count})} 13905@tab Returns the value of @var{val}, shifted right by @var{count} bits. 13906@end multitable 13907 13908For all of these functions, first the double-precision floating-point value is 13909converted to the widest C unsigned integer type, then the bitwise operation is 13910performed and then the result is converted back into a C @code{double}. (If 13911you don't understand this paragraph, don't worry about it.) 13912 13913Here is a user-defined function 13914(@pxref{User-defined}) 13915that illustrates the use of these functions: 13916 13917@cindex @code{bits2str} user-defined function 13918@cindex @code{testbits.awk} program 13919@smallexample 13920@group 13921@c file eg/lib/bits2str.awk 13922# bits2str --- turn a byte into readable 1's and 0's 13923 13924function bits2str(bits, data, mask) 13925@{ 13926 if (bits == 0) 13927 return "0" 13928 13929 mask = 1 13930 for (; bits != 0; bits = rshift(bits, 1)) 13931 data = (and(bits, mask) ? "1" : "0") data 13932 13933 while ((length(data) % 8) != 0) 13934 data = "0" data 13935 13936 return data 13937@} 13938@c endfile 13939@end group 13940 13941@c this is a hack to make testbits.awk self-contained 13942@ignore 13943@c file eg/prog/testbits.awk 13944# bits2str --- turn a byte into readable 1's and 0's 13945 13946function bits2str(bits, data, mask) 13947@{ 13948 if (bits == 0) 13949 return "0" 13950 13951 mask = 1 13952 for (; bits != 0; bits = rshift(bits, 1)) 13953 data = (and(bits, mask) ? "1" : "0") data 13954 13955 while ((length(data) % 8) != 0) 13956 data = "0" data 13957 13958 return data 13959@} 13960@c endfile 13961@end ignore 13962@c file eg/prog/testbits.awk 13963BEGIN @{ 13964 printf "123 = %s\n", bits2str(123) 13965 printf "0123 = %s\n", bits2str(0123) 13966 printf "0x99 = %s\n", bits2str(0x99) 13967 comp = compl(0x99) 13968 printf "compl(0x99) = %#x = %s\n", comp, bits2str(comp) 13969 shift = lshift(0x99, 2) 13970 printf "lshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift) 13971 shift = rshift(0x99, 2) 13972 printf "rshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift) 13973@} 13974@c endfile 13975@end smallexample 13976 13977@noindent 13978This program produces the following output when run: 13979 13980@smallexample 13981$ gawk -f testbits.awk 13982@print{} 123 = 01111011 13983@print{} 0123 = 01010011 13984@print{} 0x99 = 10011001 13985@print{} compl(0x99) = 0xffffff66 = 11111111111111111111111101100110 13986@print{} lshift(0x99, 2) = 0x264 = 0000001001100100 13987@print{} rshift(0x99, 2) = 0x26 = 00100110 13988@end smallexample 13989 13990@cindex numbers, converting, to strings 13991@cindex strings, converting, numbers to 13992@cindex converting, numbers, to strings 13993The @code{bits2str} function turns a binary number into a string. 13994The number @code{1} represents a binary value where the rightmost bit 13995is set to 1. Using this mask, 13996the function repeatedly checks the rightmost bit. 13997ANDing the mask with the value indicates whether the 13998rightmost bit is 1 or not. If so, a @code{"1"} is concatenated onto the front 13999of the string. 14000Otherwise, a @code{"0"} is added. 14001The value is then shifted right by one bit and the loop continues 14002until there are no more 1 bits. 14003 14004If the initial value is zero it returns a simple @code{"0"}. 14005Otherwise, at the end, it pads the value with zeros to represent multiples 14006of 8-bit quantities. This is typical in modern computers. 14007 14008The main code in the @code{BEGIN} rule shows the difference between the 14009decimal and octal values for the same numbers 14010(@pxref{Nondecimal-numbers}), 14011and then demonstrates the 14012results of the @code{compl}, @code{lshift}, and @code{rshift} functions. 14013@c ENDOFRANGE bit 14014@c ENDOFRANGE and 14015@c ENDOFRANGE oro 14016@c ENDOFRANGE xor 14017@c ENDOFRANGE opbit 14018 14019@node I18N Functions 14020@subsection Using @command{gawk}'s String-Translation Functions 14021@cindex @command{gawk}, string-translation functions 14022@cindex functions, string-translation 14023@cindex internationalization 14024@cindex @command{awk} programs, internationalizing 14025 14026@command{gawk} provides facilities for internationalizing @command{awk} programs. 14027These include the functions described in the following list. 14028The descriptions here are purposely brief. 14029@xref{Internationalization}, 14030for the full story. 14031Optional parameters are enclosed in square brackets ([ ]): 14032 14033@table @code 14034@cindex @code{dcgettext} function (@command{gawk}) 14035@item dcgettext(@var{string} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) 14036This function returns the translation of @var{string} in 14037text domain @var{domain} for locale category @var{category}. 14038The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. 14039The default value for @var{category} is @code{"LC_MESSAGES"}. 14040 14041@cindex @code{dcngettext} function (@command{gawk}) 14042@item dcngettext(@var{string1}, @var{string2}, @var{number} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) 14043This function returns the plural form used for @var{number} of the 14044translation of @var{string1} and @var{string2} in text domain 14045@var{domain} for locale category @var{category}. @var{string1} is the 14046English singular variant of a message, and @var{string2} the English plural 14047variant of the same message. 14048The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. 14049The default value for @var{category} is @code{"LC_MESSAGES"}. 14050 14051@cindex @code{bindtextdomain} function (@command{gawk}) 14052@item bindtextdomain(@var{directory} @r{[}, @var{domain}@r{]}) 14053This function allows you to specify the directory in which 14054@command{gawk} will look for message translation files, in case they 14055will not or cannot be placed in the ``standard'' locations 14056(e.g., during testing). 14057It returns the directory in which @var{domain} is ``bound.'' 14058 14059The default @var{domain} is the value of @code{TEXTDOMAIN}. 14060If @var{directory} is the null string (@code{""}), then 14061@code{bindtextdomain} returns the current binding for the 14062given @var{domain}. 14063@end table 14064@c ENDOFRANGE funcbi 14065@c ENDOFRANGE bifunc 14066 14067@node User-defined 14068@section User-Defined Functions 14069 14070@c STARTOFRANGE udfunc 14071@cindex user-defined, functions 14072@c STARTOFRANGE funcud 14073@cindex functions, user-defined 14074Complicated @command{awk} programs can often be simplified by defining 14075your own functions. User-defined functions can be called just like 14076built-in ones (@pxref{Function Calls}), but it is up to you to define 14077them, i.e., to tell @command{awk} what they should do. 14078 14079@menu 14080* Definition Syntax:: How to write definitions and what they mean. 14081* Function Example:: An example function definition and what it 14082 does. 14083* Function Caveats:: Things to watch out for. 14084* Return Statement:: Specifying the value a function returns. 14085* Dynamic Typing:: How variable types can change at runtime. 14086@end menu 14087 14088@node Definition Syntax 14089@subsection Function Definition Syntax 14090 14091@c STARTOFRANGE fdef 14092@cindex functions, defining 14093Definitions of functions can appear anywhere between the rules of an 14094@command{awk} program. Thus, the general form of an @command{awk} program is 14095extended to include sequences of rules @emph{and} user-defined function 14096definitions. 14097There is no need to put the definition of a function 14098before all uses of the function. This is because @command{awk} reads the 14099entire program before starting to execute any of it. 14100 14101The definition of a function named @var{name} looks like this: 14102@c NEXT ED: put [ ] around parameter list 14103 14104@example 14105function @var{name}(@var{parameter-list}) 14106@{ 14107 @var{body-of-function} 14108@} 14109@end example 14110 14111@cindex names, functions 14112@cindex functions, names of 14113@cindex namespace issues, functions 14114@noindent 14115@var{name} is the name of the function to define. A valid function 14116name is like a valid variable name: a sequence of letters, digits, and 14117underscores that doesn't start with a digit. 14118Within a single @command{awk} program, any particular name can only be 14119used as a variable, array, or function. 14120 14121@c NEXT ED: parameter-list is an OPTIONAL list of ... 14122@var{parameter-list} is a list of the function's arguments and local 14123variable names, separated by commas. When the function is called, 14124the argument names are used to hold the argument values given in 14125the call. The local variables are initialized to the empty string. 14126A function cannot have two parameters with the same name, nor may it 14127have a parameter with the same name as the function itself. 14128 14129The @var{body-of-function} consists of @command{awk} statements. It is the 14130most important part of the definition, because it says what the function 14131should actually @emph{do}. The argument names exist to give the body a 14132way to talk about the arguments; local variables exist to give the body 14133places to keep temporary values. 14134 14135Argument names are not distinguished syntactically from local variable 14136names. Instead, the number of arguments supplied when the function is 14137called determines how many argument variables there are. Thus, if three 14138argument values are given, the first three names in @var{parameter-list} 14139are arguments and the rest are local variables. 14140 14141It follows that if the number of arguments is not the same in all calls 14142to the function, some of the names in @var{parameter-list} may be 14143arguments on some occasions and local variables on others. Another 14144way to think of this is that omitted arguments default to the 14145null string. 14146 14147@cindex programming conventions, functions, writing 14148Usually when you write a function, you know how many names you intend to 14149use for arguments and how many you intend to use as local variables. It is 14150conventional to place some extra space between the arguments and 14151the local variables, in order to document how your function is supposed to be used. 14152 14153@cindex variables, shadowing 14154During execution of the function body, the arguments and local variable 14155values hide, or @dfn{shadow}, any variables of the same names used in the 14156rest of the program. The shadowed variables are not accessible in the 14157function definition, because there is no way to name them while their 14158names have been taken away for the local variables. All other variables 14159used in the @command{awk} program can be referenced or set normally in the 14160function's body. 14161 14162The arguments and local variables last only as long as the function body 14163is executing. Once the body finishes, you can once again access the 14164variables that were shadowed while the function was running. 14165 14166@cindex recursive functions 14167@cindex functions, recursive 14168The function body can contain expressions that call functions. They 14169can even call this function, either directly or by way of another 14170function. When this happens, we say the function is @dfn{recursive}. 14171The act of a function calling itself is called @dfn{recursion}. 14172 14173@c @cindex @command{awk} language, POSIX version 14174@c @cindex POSIX @command{awk} 14175@cindex POSIX @command{awk}, @code{function} keyword in 14176In many @command{awk} implementations, including @command{gawk}, 14177the keyword @code{function} may be 14178abbreviated @code{func}. However, POSIX only specifies the use of 14179the keyword @code{function}. This actually has some practical implications. 14180If @command{gawk} is in POSIX-compatibility mode 14181(@pxref{Options}), then the following 14182statement does @emph{not} define a function: 14183 14184@example 14185func foo() @{ a = sqrt($1) ; print a @} 14186@end example 14187 14188@noindent 14189Instead it defines a rule that, for each record, concatenates the value 14190of the variable @samp{func} with the return value of the function @samp{foo}. 14191If the resulting string is non-null, the action is executed. 14192This is probably not what is desired. (@command{awk} accepts this input as 14193syntactically valid, because functions may be used before they are defined 14194in @command{awk} programs.) 14195@c NEXT ED: This won't actually run, since foo() is undefined ... 14196 14197@c last comma does NOT start tertiary 14198@cindex portability, functions, defining 14199To ensure that your @command{awk} programs are portable, always use the 14200keyword @code{function} when defining a function. 14201 14202@node Function Example 14203@subsection Function Definition Examples 14204 14205Here is an example of a user-defined function, called @code{myprint}, that 14206takes a number and prints it in a specific format: 14207 14208@example 14209function myprint(num) 14210@{ 14211 printf "%6.3g\n", num 14212@} 14213@end example 14214 14215@noindent 14216To illustrate, here is an @command{awk} rule that uses our @code{myprint} 14217function: 14218 14219@example 14220$3 > 0 @{ myprint($3) @} 14221@end example 14222 14223@noindent 14224This program prints, in our special format, all the third fields that 14225contain a positive number in our input. Therefore, when given the following: 14226 14227@example 14228 1.2 3.4 5.6 7.8 14229 9.10 11.12 -13.14 15.16 1423017.18 19.20 21.22 23.24 14231@end example 14232 14233@noindent 14234this program, using our function to format the results, prints: 14235 14236@example 14237 5.6 14238 21.2 14239@end example 14240 14241This function deletes all the elements in an array: 14242 14243@example 14244function delarray(a, i) 14245@{ 14246 for (i in a) 14247 delete a[i] 14248@} 14249@end example 14250 14251When working with arrays, it is often necessary to delete all the elements 14252in an array and start over with a new list of elements 14253(@pxref{Delete}). 14254Instead of having 14255to repeat this loop everywhere that you need to clear out 14256an array, your program can just call @code{delarray}. 14257(This guarantees portability. The use of @samp{delete @var{array}} to delete 14258the contents of an entire array is a nonstandard extension.) 14259 14260The following is an example of a recursive function. It takes a string 14261as an input parameter and returns the string in backwards order. 14262Recursive functions must always have a test that stops the recursion. 14263In this case, the recursion terminates when the starting position 14264is zero, i.e., when there are no more characters left in the string. 14265 14266@cindex @code{rev} user-defined function 14267@example 14268function rev(str, start) 14269@{ 14270 if (start == 0) 14271 return "" 14272 14273 return (substr(str, start, 1) rev(str, start - 1)) 14274@} 14275@end example 14276 14277If this function is in a file named @file{rev.awk}, it can be tested 14278this way: 14279 14280@example 14281$ echo "Don't Panic!" | 14282> gawk --source '@{ print rev($0, length($0)) @}' -f rev.awk 14283@print{} !cinaP t'noD 14284@end example 14285 14286The C @code{ctime} function takes a timestamp and returns it in a string, 14287formatted in a well-known fashion. 14288The following example uses the built-in @code{strftime} function 14289(@pxref{Time Functions}) 14290to create an @command{awk} version of @code{ctime}: 14291 14292@cindex @code{ctime} user-defined function 14293@c FIXME: One day, change %d to %e, when C 99 is common. 14294@example 14295@c file eg/lib/ctime.awk 14296# ctime.awk 14297# 14298# awk version of C ctime(3) function 14299 14300function ctime(ts, format) 14301@{ 14302 format = "%a %b %d %H:%M:%S %Z %Y" 14303 if (ts == 0) 14304 ts = systime() # use current time as default 14305 return strftime(format, ts) 14306@} 14307@c endfile 14308@end example 14309@c ENDOFRANGE fdef 14310 14311@node Function Caveats 14312@subsection Calling User-Defined Functions 14313 14314@c STARTOFRANGE fudc 14315@cindex functions, user-defined, calling 14316@dfn{Calling a function} means causing the function to run and do its job. 14317A function call is an expression and its value is the value returned by 14318the function. 14319 14320A function call consists of the function name followed by the arguments 14321in parentheses. @command{awk} expressions are what you write in the 14322call for the arguments. Each time the call is executed, these 14323expressions are evaluated, and the values are the actual arguments. For 14324example, here is a call to @code{foo} with three arguments (the first 14325being a string concatenation): 14326 14327@example 14328foo(x y, "lose", 4 * z) 14329@end example 14330 14331@strong{Caution:} Whitespace characters (spaces and tabs) are not allowed 14332between the function name and the open-parenthesis of the argument list. 14333If you write whitespace by mistake, @command{awk} might think that you mean 14334to concatenate a variable with an expression in parentheses. However, it 14335notices that you used a function name and not a variable name, and reports 14336an error. 14337 14338@cindex call by value 14339When a function is called, it is given a @emph{copy} of the values of 14340its arguments. This is known as @dfn{call by value}. The caller may use 14341a variable as the expression for the argument, but the called function 14342does not know this---it only knows what value the argument had. For 14343example, if you write the following code: 14344 14345@example 14346foo = "bar" 14347z = myfunc(foo) 14348@end example 14349 14350@noindent 14351then you should not think of the argument to @code{myfunc} as being 14352``the variable @code{foo}.'' Instead, think of the argument as the 14353string value @code{"bar"}. 14354If the function @code{myfunc} alters the values of its local variables, 14355this has no effect on any other variables. Thus, if @code{myfunc} 14356does this: 14357 14358@example 14359function myfunc(str) 14360@{ 14361 print str 14362 str = "zzz" 14363 print str 14364@} 14365@end example 14366 14367@noindent 14368to change its first argument variable @code{str}, it does @emph{not} 14369change the value of @code{foo} in the caller. The role of @code{foo} in 14370calling @code{myfunc} ended when its value (@code{"bar"}) was computed. 14371If @code{str} also exists outside of @code{myfunc}, the function body 14372cannot alter this outer value, because it is shadowed during the 14373execution of @code{myfunc} and cannot be seen or changed from there. 14374 14375@cindex call by reference 14376@cindex arrays, as parameters to functions 14377@cindex functions, arrays as parameters to 14378However, when arrays are the parameters to functions, they are @emph{not} 14379copied. Instead, the array itself is made available for direct manipulation 14380by the function. This is usually called @dfn{call by reference}. 14381Changes made to an array parameter inside the body of a function @emph{are} 14382visible outside that function. 14383 14384@strong{Note:} Changing an array parameter inside a function 14385can be very dangerous if you do not watch what you are doing. 14386For example: 14387 14388@example 14389function changeit(array, ind, nvalue) 14390@{ 14391 array[ind] = nvalue 14392@} 14393 14394BEGIN @{ 14395 a[1] = 1; a[2] = 2; a[3] = 3 14396 changeit(a, 2, "two") 14397 printf "a[1] = %s, a[2] = %s, a[3] = %s\n", 14398 a[1], a[2], a[3] 14399@} 14400@end example 14401 14402@noindent 14403prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because 14404@code{changeit} stores @code{"two"} in the second element of @code{a}. 14405 14406@cindex undefined functions 14407@cindex functions, undefined 14408Some @command{awk} implementations allow you to call a function that 14409has not been defined. They only report a problem at runtime when the 14410program actually tries to call the function. For example: 14411 14412@example 14413BEGIN @{ 14414 if (0) 14415 foo() 14416 else 14417 bar() 14418@} 14419function bar() @{ @dots{} @} 14420# note that `foo' is not defined 14421@end example 14422 14423@noindent 14424Because the @samp{if} statement will never be true, it is not really a 14425problem that @code{foo} has not been defined. Usually, though, it is a 14426problem if a program calls an undefined function. 14427 14428@cindex lint checking, undefined functions 14429If @option{--lint} is specified 14430(@pxref{Options}), 14431@command{gawk} reports calls to undefined functions. 14432 14433@cindex portability, @code{next} statement in user-defined functions 14434Some @command{awk} implementations generate a runtime 14435error if you use the @code{next} statement 14436(@pxref{Next Statement}) 14437inside a user-defined function. 14438@command{gawk} does not have this limitation. 14439@c ENDOFRANGE fudc 14440 14441@node Return Statement 14442@subsection The @code{return} Statement 14443@c comma does NOT start a secondary 14444@cindex @code{return} statement, user-defined functions 14445 14446The body of a user-defined function can contain a @code{return} statement. 14447This statement returns control to the calling part of the @command{awk} program. It 14448can also be used to return a value for use in the rest of the @command{awk} 14449program. It looks like this: 14450 14451@example 14452return @r{[}@var{expression}@r{]} 14453@end example 14454 14455The @var{expression} part is optional. If it is omitted, then the returned 14456value is undefined, and therefore, unpredictable. 14457 14458A @code{return} statement with no value expression is assumed at the end of 14459every function definition. So if control reaches the end of the function 14460body, then the function returns an unpredictable value. @command{awk} 14461does @emph{not} warn you if you use the return value of such a function. 14462 14463Sometimes, you want to write a function for what it does, not for 14464what it returns. Such a function corresponds to a @code{void} function 14465in C or to a @code{procedure} in Pascal. Thus, it may be appropriate to not 14466return any value; simply bear in mind that if you use the return 14467value of such a function, you do so at your own risk. 14468 14469The following is an example of a user-defined function that returns a value 14470for the largest number among the elements of an array: 14471 14472@example 14473function maxelt(vec, i, ret) 14474@{ 14475 for (i in vec) @{ 14476 if (ret == "" || vec[i] > ret) 14477 ret = vec[i] 14478 @} 14479 return ret 14480@} 14481@end example 14482 14483@cindex programming conventions, function parameters 14484@noindent 14485You call @code{maxelt} with one argument, which is an array name. The local 14486variables @code{i} and @code{ret} are not intended to be arguments; 14487while there is nothing to stop you from passing more than one argument 14488to @code{maxelt}, the results would be strange. The extra space before 14489@code{i} in the function parameter list indicates that @code{i} and 14490@code{ret} are not supposed to be arguments. 14491You should follow this convention when defining functions. 14492 14493The following program uses the @code{maxelt} function. It loads an 14494array, calls @code{maxelt}, and then reports the maximum number in that 14495array: 14496 14497@example 14498function maxelt(vec, i, ret) 14499@{ 14500 for (i in vec) @{ 14501 if (ret == "" || vec[i] > ret) 14502 ret = vec[i] 14503 @} 14504 return ret 14505@} 14506 14507# Load all fields of each record into nums. 14508@{ 14509 for(i = 1; i <= NF; i++) 14510 nums[NR, i] = $i 14511@} 14512 14513END @{ 14514 print maxelt(nums) 14515@} 14516@end example 14517 14518Given the following input: 14519 14520@example 14521 1 5 23 8 16 1452244 3 5 2 8 26 14523256 291 1396 2962 100 14524-6 467 998 1101 1452599385 11 0 225 14526@end example 14527 14528@noindent 14529the program reports (predictably) that @code{99385} is the largest number 14530in the array. 14531 14532@node Dynamic Typing 14533@subsection Functions and Their Effects on Variable Typing 14534 14535@command{awk} is a very fluid language. 14536It is possible that @command{awk} can't tell if an identifier 14537represents a regular variable or an array until runtime. 14538Here is an annotated sample program: 14539 14540@example 14541function foo(a) 14542@{ 14543 a[1] = 1 # parameter is an array 14544@} 14545 14546BEGIN @{ 14547 b = 1 14548 foo(b) # invalid: fatal type mismatch 14549 14550 foo(x) # x uninitialized, becomes an array dynamically 14551 x = 1 # now not allowed, runtime error 14552@} 14553@end example 14554 14555Usually, such things aren't a big issue, but it's worth 14556being aware of them. 14557@c ENDOFRANGE udfunc 14558@c ENDOFRANGE funcud 14559 14560@node Internationalization 14561@chapter Internationalization with @command{gawk} 14562 14563Once upon a time, computer makers 14564wrote software that worked only in English. 14565Eventually, hardware and software vendors noticed that if their 14566systems worked in the native languages of non-English-speaking 14567countries, they were able to sell more systems. 14568As a result, internationalization and localization 14569of programs and software systems became a common practice. 14570 14571@c STARTOFRANGE inloc 14572@cindex internationalization, localization 14573@cindex @command{gawk}, internationalization and, See internationalization 14574@cindex internationalization, localization, @command{gawk} and 14575Until recently, the ability to provide internationalization 14576was largely restricted to programs written in C and C++. 14577This @value{CHAPTER} describes the underlying library @command{gawk} 14578uses for internationalization, as well as how 14579@command{gawk} makes internationalization 14580features available at the @command{awk} program level. 14581Having internationalization available at the @command{awk} level 14582gives software developers additional flexibility---they are no 14583longer required to write in C when internationalization is 14584a requirement. 14585 14586@menu 14587* I18N and L10N:: Internationalization and Localization. 14588* Explaining gettext:: How GNU @code{gettext} works. 14589* Programmer i18n:: Features for the programmer. 14590* Translator i18n:: Features for the translator. 14591* I18N Example:: A simple i18n example. 14592* Gawk I18N:: @command{gawk} is also internationalized. 14593@end menu 14594 14595@node I18N and L10N 14596@section Internationalization and Localization 14597 14598@cindex internationalization 14599@c comma is part of see 14600@cindex localization, See internationalization, localization 14601@cindex localization 14602@dfn{Internationalization} means writing (or modifying) a program once, 14603in such a way that it can use multiple languages without requiring 14604further source-code changes. 14605@dfn{Localization} means providing the data necessary for an 14606internationalized program to work in a particular language. 14607Most typically, these terms refer to features such as the language 14608used for printing error messages, the language used to read 14609responses, and information related to how numerical and 14610monetary values are printed and read. 14611 14612@node Explaining gettext 14613@section GNU @code{gettext} 14614 14615@cindex internationalizing a program 14616@c STARTOFRANGE gettex 14617@cindex @code{gettext} library 14618The facilities in GNU @code{gettext} focus on messages; strings printed 14619by a program, either directly or via formatting with @code{printf} or 14620@code{sprintf}.@footnote{For some operating systems, the @command{gawk} 14621port doesn't support GNU @code{gettext}. This applies most notably to 14622the PC operating systems. As such, these features are not available 14623if you are using one of those operating systems. Sorry.} 14624 14625@cindex portability, @code{gettext} library and 14626When using GNU @code{gettext}, each application has its own 14627@dfn{text domain}. This is a unique name, such as @samp{kpilot} or @samp{gawk}, 14628that identifies the application. 14629A complete application may have multiple components---programs written 14630in C or C++, as well as scripts written in @command{sh} or @command{awk}. 14631All of the components use the same text domain. 14632 14633To make the discussion concrete, assume we're writing an application 14634named @command{guide}. Internationalization consists of the 14635following steps, in this order: 14636 14637@enumerate 14638@item 14639The programmer goes 14640through the source for all of @command{guide}'s components 14641and marks each string that is a candidate for translation. 14642For example, @code{"`-F': option required"} is a good candidate for translation. 14643A table with strings of option names is not (e.g., @command{gawk}'s 14644@option{--profile} option should remain the same, no matter what the local 14645language). 14646 14647@cindex @code{textdomain} function (C library) 14648@item 14649The programmer indicates the application's text domain 14650(@code{"guide"}) to the @code{gettext} library, 14651by calling the @code{textdomain} function. 14652 14653@item 14654Messages from the application are extracted from the source code and 14655collected into a portable object file (@file{guide.po}), 14656which lists the strings and their translations. 14657The translations are initially empty. 14658The original (usually English) messages serve as the key for 14659lookup of the translations. 14660 14661@cindex @code{.po} files 14662@cindex files, @code{.po} 14663@cindex portable object files 14664@cindex files, portable object 14665@item 14666For each language with a translator, @file{guide.po} 14667is copied and translations are created and shipped with the application. 14668 14669@cindex @code{.mo} files 14670@cindex files, @code{.mo} 14671@cindex message object files 14672@cindex files, message object 14673@item 14674Each language's @file{.po} file is converted into a binary 14675message object (@file{.mo}) file. 14676A message object file contains the original messages and their 14677translations in a binary format that allows fast lookup of translations 14678at runtime. 14679 14680@item 14681When @command{guide} is built and installed, the binary translation files 14682are installed in a standard place. 14683 14684@cindex @code{bindtextdomain} function (C library) 14685@item 14686For testing and development, it is possible to tell @code{gettext} 14687to use @file{.mo} files in a different directory than the standard 14688one by using the @code{bindtextdomain} function. 14689 14690@cindex @code{.mo} files, specifying directory of 14691@cindex files, @code{.mo}, specifying directory of 14692@cindex message object files, specifying directory of 14693@cindex files, message object, specifying directory of 14694@item 14695At runtime, @command{guide} looks up each string via a call 14696to @code{gettext}. The returned string is the translated string 14697if available, or the original string if not. 14698 14699@item 14700If necessary, it is possible to access messages from a different 14701text domain than the one belonging to the application, without 14702having to switch the application's default text domain back 14703and forth. 14704@end enumerate 14705 14706@cindex @code{gettext} function (C library) 14707In C (or C++), the string marking and dynamic translation lookup 14708are accomplished by wrapping each string in a call to @code{gettext}: 14709 14710@example 14711printf(gettext("Don't Panic!\n")); 14712@end example 14713 14714The tools that extract messages from source code pull out all 14715strings enclosed in calls to @code{gettext}. 14716 14717@cindex @code{_} (underscore), @code{_} C macro 14718@cindex underscore (@code{_}), @code{_} C macro 14719The GNU @code{gettext} developers, recognizing that typing 14720@samp{gettext} over and over again is both painful and ugly to look 14721at, use the macro @samp{_} (an underscore) to make things easier: 14722 14723@example 14724/* In the standard header file: */ 14725#define _(str) gettext(str) 14726 14727/* In the program text: */ 14728printf(_("Don't Panic!\n")); 14729@end example 14730 14731@cindex internationalization, localization, locale categories 14732@cindex @code{gettext} library, locale categories 14733@cindex locale categories 14734@noindent 14735This reduces the typing overhead to just three extra characters per string 14736and is considerably easier to read as well. 14737There are locale @dfn{categories} 14738for different types of locale-related information. 14739The defined locale categories that @code{gettext} knows about are: 14740 14741@table @code 14742@cindex @code{LC_MESSAGES} locale category 14743@item LC_MESSAGES 14744Text messages. This is the default category for @code{gettext} 14745operations, but it is possible to supply a different one explicitly, 14746if necessary. (It is almost never necessary to supply a different category.) 14747 14748@cindex sorting characters in different languages 14749@cindex @code{LC_COLLATE} locale category 14750@item LC_COLLATE 14751Text-collation information; i.e., how different characters 14752and/or groups of characters sort in a given language. 14753 14754@cindex @code{LC_CTYPE} locale category 14755@item LC_CTYPE 14756Character-type information (alphabetic, digit, upper- or lowercase, and 14757so on). 14758This information is accessed via the 14759POSIX character classes in regular expressions, 14760such as @code{/[[:alnum:]]/} 14761(@pxref{Regexp Operators}). 14762 14763@cindex monetary information, localization 14764@cindex currency symbols, localization 14765@cindex @code{LC_MONETARY} locale category 14766@item LC_MONETARY 14767Monetary information, such as the currency symbol, and whether the 14768symbol goes before or after a number. 14769 14770@cindex @code{LC_NUMERIC} locale category 14771@item LC_NUMERIC 14772Numeric information, such as which characters to use for the decimal 14773point and the thousands separator.@footnote{Americans 14774use a comma every three decimal places and a period for the decimal 14775point, while many Europeans do exactly the opposite: 14776@code{1,234.56} versus @code{1.234,56}.} 14777 14778@cindex @code{LC_RESPONSE} locale category 14779@item LC_RESPONSE 14780Response information, such as how ``yes'' and ``no'' appear in the 14781local language, and possibly other information as well. 14782 14783@cindex time, localization and 14784@c last comma does NOT start a tertiary 14785@cindex dates, information related to, localization 14786@cindex @code{LC_TIME} locale category 14787@item LC_TIME 14788Time- and date-related information, such as 12- or 24-hour clock, month printed 14789before or after day in a date, local month abbreviations, and so on. 14790 14791@cindex @code{LC_ALL} locale category 14792@item LC_ALL 14793All of the above. (Not too useful in the context of @code{gettext}.) 14794@end table 14795@c ENDOFRANGE gettex 14796 14797@node Programmer i18n 14798@section Internationalizing @command{awk} Programs 14799@c STARTOFRANGE inap 14800@cindex @command{awk} programs, internationalizing 14801 14802@command{gawk} provides the following variables and functions for 14803internationalization: 14804 14805@table @code 14806@cindex @code{TEXTDOMAIN} variable 14807@item TEXTDOMAIN 14808This variable indicates the application's text domain. 14809For compatibility with GNU @code{gettext}, the default 14810value is @code{"messages"}. 14811 14812@cindex internationalization, localization, marked strings 14813@cindex strings, for localization 14814@item _"your message here" 14815String constants marked with a leading underscore 14816are candidates for translation at runtime. 14817String constants without a leading underscore are not translated. 14818 14819@cindex @code{dcgettext} function (@command{gawk}) 14820@item dcgettext(@var{string} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) 14821This built-in function returns the translation of @var{string} in 14822text domain @var{domain} for locale category @var{category}. 14823The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. 14824The default value for @var{category} is @code{"LC_MESSAGES"}. 14825 14826If you supply a value for @var{category}, it must be a string equal to 14827one of the known locale categories described in 14828@ifnotinfo 14829the previous @value{SECTION}. 14830@end ifnotinfo 14831@ifinfo 14832@ref{Explaining gettext}. 14833@end ifinfo 14834You must also supply a text domain. Use @code{TEXTDOMAIN} if 14835you want to use the current domain. 14836 14837@strong{Caution:} The order of arguments to the @command{awk} version 14838of the @code{dcgettext} function is purposely different from the order for 14839the C version. The @command{awk} version's order was 14840chosen to be simple and to allow for reasonable @command{awk}-style 14841default arguments. 14842 14843@cindex @code{dcngettext} function (@command{gawk}) 14844@item dcngettext(@var{string1}, @var{string2}, @var{number} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) 14845This built-in function returns the plural form used for @var{number} of the 14846translation of @var{string1} and @var{string2} in text domain 14847@var{domain} for locale category @var{category}. @var{string1} is the 14848English singular variant of a message, and @var{string2} the English plural 14849variant of the same message. 14850The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. 14851The default value for @var{category} is @code{"LC_MESSAGES"}. 14852 14853The same remarks as for the @code{dcgettext} function apply. 14854 14855@cindex @code{.mo} files, specifying directory of 14856@cindex files, @code{.mo}, specifying directory of 14857@cindex message object files, specifying directory of 14858@cindex files, message object, specifying directory of 14859@cindex @code{bindtextdomain} function (@command{gawk}) 14860@item bindtextdomain(@var{directory} @r{[}, @var{domain}@r{]}) 14861This built-in function allows you to specify the directory in which 14862@code{gettext} looks for @file{.mo} files, in case they 14863will not or cannot be placed in the standard locations 14864(e.g., during testing). 14865It returns the directory in which @var{domain} is ``bound.'' 14866 14867The default @var{domain} is the value of @code{TEXTDOMAIN}. 14868If @var{directory} is the null string (@code{""}), then 14869@code{bindtextdomain} returns the current binding for the 14870given @var{domain}. 14871@end table 14872 14873To use these facilities in your @command{awk} program, follow the steps 14874outlined in 14875@ifnotinfo 14876the previous @value{SECTION}, 14877@end ifnotinfo 14878@ifinfo 14879@ref{Explaining gettext}, 14880@end ifinfo 14881like so: 14882 14883@enumerate 14884@cindex @code{BEGIN} pattern, @code{TEXTDOMAIN} variable and 14885@cindex @code{TEXTDOMAIN} variable, @code{BEGIN} pattern and 14886@item 14887Set the variable @code{TEXTDOMAIN} to the text domain of 14888your program. This is best done in a @code{BEGIN} rule 14889(@pxref{BEGIN/END}), 14890or it can also be done via the @option{-v} command-line 14891option (@pxref{Options}): 14892 14893@example 14894BEGIN @{ 14895 TEXTDOMAIN = "guide" 14896 @dots{} 14897@} 14898@end example 14899 14900@cindex @code{_} (underscore), translatable string 14901@cindex underscore (@code{_}), translatable string 14902@item 14903Mark all translatable strings with a leading underscore (@samp{_}) 14904character. It @emph{must} be adjacent to the opening 14905quote of the string. For example: 14906 14907@example 14908print _"hello, world" 14909x = _"you goofed" 14910printf(_"Number of users is %d\n", nusers) 14911@end example 14912 14913@item 14914If you are creating strings dynamically, you can 14915still translate them, using the @code{dcgettext} 14916built-in function: 14917 14918@example 14919message = nusers " users logged in" 14920message = dcgettext(message, "adminprog") 14921print message 14922@end example 14923 14924Here, the call to @code{dcgettext} supplies a different 14925text domain (@code{"adminprog"}) in which to find the 14926message, but it uses the default @code{"LC_MESSAGES"} category. 14927 14928@cindex @code{LC_MESSAGES} locale category, @code{bindtextdomain} function (@command{gawk}) 14929@item 14930During development, you might want to put the @file{.mo} 14931file in a private directory for testing. This is done 14932with the @code{bindtextdomain} built-in function: 14933 14934@example 14935BEGIN @{ 14936 TEXTDOMAIN = "guide" # our text domain 14937 if (Testing) @{ 14938 # where to find our files 14939 bindtextdomain("testdir") 14940 # joe is in charge of adminprog 14941 bindtextdomain("../joe/testdir", "adminprog") 14942 @} 14943 @dots{} 14944@} 14945@end example 14946 14947@end enumerate 14948 14949@xref{I18N Example}, 14950for an example program showing the steps to create 14951and use translations from @command{awk}. 14952 14953@node Translator i18n 14954@section Translating @command{awk} Programs 14955 14956@cindex @code{.po} files 14957@cindex files, @code{.po} 14958@cindex portable object files 14959@cindex files, portable object 14960Once a program's translatable strings have been marked, they must 14961be extracted to create the initial @file{.po} file. 14962As part of translation, it is often helpful to rearrange the order 14963in which arguments to @code{printf} are output. 14964 14965@command{gawk}'s @option{--gen-po} command-line option extracts 14966the messages and is discussed next. 14967After that, @code{printf}'s ability to 14968rearrange the order for @code{printf} arguments at runtime 14969is covered. 14970 14971@menu 14972* String Extraction:: Extracting marked strings. 14973* Printf Ordering:: Rearranging @code{printf} arguments. 14974* I18N Portability:: @command{awk}-level portability issues. 14975@end menu 14976 14977@node String Extraction 14978@subsection Extracting Marked Strings 14979@cindex strings, extracting 14980@c comma does NOT start secondary 14981@cindex marked strings, extracting 14982@cindex @code{--gen-po} option 14983@cindex command-line options, string extraction 14984@cindex string extraction (internationalization) 14985@cindex marked string extraction (internationalization) 14986@cindex extraction, of marked strings (internationalization) 14987 14988@cindex @code{--gen-po} option 14989Once your @command{awk} program is working, and all the strings have 14990been marked and you've set (and perhaps bound) the text domain, 14991it is time to produce translations. 14992First, use the @option{--gen-po} command-line option to create 14993the initial @file{.po} file: 14994 14995@example 14996$ gawk --gen-po -f guide.awk > guide.po 14997@end example 14998 14999@cindex @code{xgettext} utility 15000When run with @option{--gen-po}, @command{gawk} does not execute your 15001program. Instead, it parses it as usual and prints all marked strings 15002to standard output in the format of a GNU @code{gettext} Portable Object 15003file. Also included in the output are any constant strings that 15004appear as the first argument to @code{dcgettext} or as the first and 15005second argument to @code{dcngettext}.@footnote{Starting with @code{gettext} 15006version 0.11.5, the @command{xgettext} utility that comes with GNU 15007@code{gettext} can handle @file{.awk} files.} 15008@xref{I18N Example}, 15009for the full list of steps to go through to create and test 15010translations for @command{guide}. 15011 15012@node Printf Ordering 15013@subsection Rearranging @code{printf} Arguments 15014 15015@cindex @code{printf} statement, positional specifiers 15016@c comma does NOT start secondary 15017@cindex positional specifiers, @code{printf} statement 15018Format strings for @code{printf} and @code{sprintf} 15019(@pxref{Printf}) 15020present a special problem for translation. 15021Consider the following:@footnote{This example is borrowed 15022from the GNU @code{gettext} manual.} 15023 15024@c line broken here only for smallbook format 15025@example 15026printf(_"String `%s' has %d characters\n", 15027 string, length(string))) 15028@end example 15029 15030A possible German translation for this might be: 15031 15032@example 15033"%d Zeichen lang ist die Zeichenkette `%s'\n" 15034@end example 15035 15036The problem should be obvious: the order of the format 15037specifications is different from the original! 15038Even though @code{gettext} can return the translated string 15039at runtime, 15040it cannot change the argument order in the call to @code{printf}. 15041 15042To solve this problem, @code{printf} format specificiers may have 15043an additional optional element, which we call a @dfn{positional specifier}. 15044For example: 15045 15046@example 15047"%2$d Zeichen lang ist die Zeichenkette `%1$s'\n" 15048@end example 15049 15050Here, the positional specifier consists of an integer count, which indicates which 15051argument to use, and a @samp{$}. Counts are one-based, and the 15052format string itself is @emph{not} included. Thus, in the following 15053example, @samp{string} is the first argument and @samp{length(string)} is the second: 15054 15055@example 15056$ gawk 'BEGIN @{ 15057> string = "Dont Panic" 15058> printf _"%2$d characters live in \"%1$s\"\n", 15059> string, length(string) 15060> @}' 15061@print{} 10 characters live in "Dont Panic" 15062@end example 15063 15064If present, positional specifiers come first in the format specification, 15065before the flags, the field width, and/or the precision. 15066 15067Positional specifiers can be used with the dynamic field width and 15068precision capability: 15069 15070@example 15071$ gawk 'BEGIN @{ 15072> printf("%*.*s\n", 10, 20, "hello") 15073> printf("%3$*2$.*1$s\n", 20, 10, "hello") 15074> @}' 15075@print{} hello 15076@print{} hello 15077@end example 15078 15079@noindent 15080@strong{Note:} When using @samp{*} with a positional specifier, the @samp{*} 15081comes first, then the integer position, and then the @samp{$}. 15082This is somewhat counterintutive. 15083 15084@cindex @code{printf} statement, positional specifiers, mixing with regular formats 15085@c first comma does is part of primary 15086@cindex positional specifiers, @code{printf} statement, mixing with regular formats 15087@cindex format specifiers, mixing regular with positional specifiers 15088@command{gawk} does not allow you to mix regular format specifiers 15089and those with positional specifiers in the same string: 15090 15091@smallexample 15092$ gawk 'BEGIN @{ printf _"%d %3$s\n", 1, 2, "hi" @}' 15093@error{} gawk: cmd. line:1: fatal: must use `count$' on all formats or none 15094@end smallexample 15095 15096@strong{Note:} There are some pathological cases that @command{gawk} may fail to 15097diagnose. In such cases, the output may not be what you expect. 15098It's still a bad idea to try mixing them, even if @command{gawk} 15099doesn't detect it. 15100 15101Although positional specifiers can be used directly in @command{awk} programs, 15102their primary purpose is to help in producing correct translations of 15103format strings into languages different from the one in which the program 15104is first written. 15105 15106@node I18N Portability 15107@subsection @command{awk} Portability Issues 15108 15109@cindex portability, internationalization and 15110@cindex internationalization, localization, portability and 15111@command{gawk}'s internationalization features were purposely chosen to 15112have as little impact as possible on the portability of @command{awk} 15113programs that use them to other versions of @command{awk}. 15114Consider this program: 15115 15116@example 15117BEGIN @{ 15118 TEXTDOMAIN = "guide" 15119 if (Test_Guide) # set with -v 15120 bindtextdomain("/test/guide/messages") 15121 print _"don't panic!" 15122@} 15123@end example 15124 15125@noindent 15126As written, it won't work on other versions of @command{awk}. 15127However, it is actually almost portable, requiring very little 15128change: 15129 15130@itemize @bullet 15131@cindex @code{TEXTDOMAIN} variable, portability and 15132@item 15133Assignments to @code{TEXTDOMAIN} won't have any effect, 15134since @code{TEXTDOMAIN} is not special in other @command{awk} implementations. 15135 15136@item 15137Non-GNU versions of @command{awk} treat marked strings 15138as the concatenation of a variable named @code{_} with the string 15139following it.@footnote{This is good fodder for an ``Obfuscated 15140@command{awk}'' contest.} Typically, the variable @code{_} has 15141the null string (@code{""}) as its value, leaving the original string constant as 15142the result. 15143 15144@item 15145By defining ``dummy'' functions to replace @code{dcgettext}, @code{dcngettext} 15146and @code{bindtextdomain}, the @command{awk} program can be made to run, but 15147all the messages are output in the original language. 15148For example: 15149 15150@cindex @code{bindtextdomain} function (@command{gawk}), portability and 15151@cindex @code{dcgettext} function (@command{gawk}), portability and 15152@cindex @code{dcngettext} function (@command{gawk}), portability and 15153@example 15154@c file eg/lib/libintl.awk 15155function bindtextdomain(dir, domain) 15156@{ 15157 return dir 15158@} 15159 15160function dcgettext(string, domain, category) 15161@{ 15162 return string 15163@} 15164 15165function dcngettext(string1, string2, number, domain, category) 15166@{ 15167 return (number == 1 ? string1 : string2) 15168@} 15169@c endfile 15170@end example 15171 15172@item 15173The use of positional specifications in @code{printf} or 15174@code{sprintf} is @emph{not} portable. 15175To support @code{gettext} at the C level, many systems' C versions of 15176@code{sprintf} do support positional specifiers. But it works only if 15177enough arguments are supplied in the function call. Many versions of 15178@command{awk} pass @code{printf} formats and arguments unchanged to the 15179underlying C library version of @code{sprintf}, but only one format and 15180argument at a time. What happens if a positional specification is 15181used is anybody's guess. 15182However, since the positional specifications are primarily for use in 15183@emph{translated} format strings, and since non-GNU @command{awk}s never 15184retrieve the translated string, this should not be a problem in practice. 15185@end itemize 15186@c ENDOFRANGE inap 15187 15188@node I18N Example 15189@section A Simple Internationalization Example 15190 15191Now let's look at a step-by-step example of how to internationalize and 15192localize a simple @command{awk} program, using @file{guide.awk} as our 15193original source: 15194 15195@example 15196@c file eg/prog/guide.awk 15197BEGIN @{ 15198 TEXTDOMAIN = "guide" 15199 bindtextdomain(".") # for testing 15200 print _"Don't Panic" 15201 print _"The Answer Is", 42 15202 print "Pardon me, Zaphod who?" 15203@} 15204@c endfile 15205@end example 15206 15207@noindent 15208Run @samp{gawk --gen-po} to create the @file{.po} file: 15209 15210@example 15211$ gawk --gen-po -f guide.awk > guide.po 15212@end example 15213 15214@noindent 15215This produces: 15216 15217@example 15218@c file eg/data/guide.po 15219#: guide.awk:4 15220msgid "Don't Panic" 15221msgstr "" 15222 15223#: guide.awk:5 15224msgid "The Answer Is" 15225msgstr "" 15226 15227@c endfile 15228@end example 15229 15230This original portable object file is saved and reused for each language 15231into which the application is translated. The @code{msgid} 15232is the original string and the @code{msgstr} is the translation. 15233 15234@strong{Note:} Strings not marked with a leading underscore do not 15235appear in the @file{guide.po} file. 15236 15237Next, the messages must be translated. 15238Here is a translation to a hypothetical dialect of English, 15239called ``Mellow'':@footnote{Perhaps it would be better if it were 15240called ``Hippy.'' Ah, well.} 15241 15242@example 15243@group 15244$ cp guide.po guide-mellow.po 15245@var{Add translations to} guide-mellow.po @dots{} 15246@end group 15247@end example 15248 15249@noindent 15250Following are the translations: 15251 15252@example 15253@c file eg/data/guide-mellow.po 15254#: guide.awk:4 15255msgid "Don't Panic" 15256msgstr "Hey man, relax!" 15257 15258#: guide.awk:5 15259msgid "The Answer Is" 15260msgstr "Like, the scoop is" 15261 15262@c endfile 15263@end example 15264 15265@cindex Linux 15266@cindex GNU/Linux 15267The next step is to make the directory to hold the binary message object 15268file and then to create the @file{guide.mo} file. 15269The directory layout shown here is standard for GNU @code{gettext} on 15270GNU/Linux systems. Other versions of @code{gettext} may use a different 15271layout: 15272 15273@example 15274$ mkdir en_US en_US/LC_MESSAGES 15275@end example 15276 15277@cindex @code{.po} files, converting to @code{.mo} 15278@cindex files, @code{.po}, converting to @code{.mo} 15279@cindex @code{.mo} files, converting from @code{.po} 15280@cindex files, @code{.mo}, converting from @code{.po} 15281@cindex portable object files, converting to message object files 15282@cindex files, portable object, converting to message object files 15283@cindex message object files, converting from portable object files 15284@cindex files, message object, converting from portable object files 15285@cindex @command{msgfmt} utility 15286The @command{msgfmt} utility does the conversion from human-readable 15287@file{.po} file to machine-readable @file{.mo} file. 15288By default, @command{msgfmt} creates a file named @file{messages}. 15289This file must be renamed and placed in the proper directory so that 15290@command{gawk} can find it: 15291 15292@example 15293$ msgfmt guide-mellow.po 15294$ mv messages en_US/LC_MESSAGES/guide.mo 15295@end example 15296 15297Finally, we run the program to test it: 15298 15299@example 15300$ gawk -f guide.awk 15301@print{} Hey man, relax! 15302@print{} Like, the scoop is 42 15303@print{} Pardon me, Zaphod who? 15304@end example 15305 15306If the three replacement functions for @code{dcgettext}, @code{dcngettext} 15307and @code{bindtextdomain} 15308(@pxref{I18N Portability}) 15309are in a file named @file{libintl.awk}, 15310then we can run @file{guide.awk} unchanged as follows: 15311 15312@example 15313$ gawk --posix -f guide.awk -f libintl.awk 15314@print{} Don't Panic 15315@print{} The Answer Is 42 15316@print{} Pardon me, Zaphod who? 15317@end example 15318 15319@node Gawk I18N 15320@section @command{gawk} Can Speak Your Language 15321 15322As of @value{PVERSION} 3.1, @command{gawk} itself has been internationalized 15323using the GNU @code{gettext} package. 15324@ifinfo 15325(GNU @code{gettext} is described in 15326complete detail in 15327@ref{Top}.) 15328@end ifinfo 15329@ifnotinfo 15330(GNU @code{gettext} is described in 15331complete detail in 15332@cite{GNU gettext tools}.) 15333@end ifnotinfo 15334As of this writing, the latest version of GNU @code{gettext} is 15335@uref{ftp://ftp.gnu.org/gnu/gettext/gettext-0.11.5.tar.gz, @value{PVERSION} 0.11.5}. 15336 15337If a translation of @command{gawk}'s messages exists, 15338then @command{gawk} produces usage messages, warnings, 15339and fatal errors in the local language. 15340 15341@cindex @code{--with-included-gettext} configuration option 15342@cindex configuration option, @code{--with-included-gettext} 15343On systems that do not use @value{PVERSION} 2 (or later) of the GNU C library, you should 15344configure @command{gawk} with the @option{--with-included-gettext} option 15345before compiling and installing it. 15346@xref{Additional Configuration Options}, 15347for more information. 15348@c ENDOFRANGE inloc 15349 15350@node Advanced Features 15351@chapter Advanced Features of @command{gawk} 15352@cindex advanced features, network connections, See Also networks, connections 15353@c STARTOFRANGE gawadv 15354@cindex @command{gawk}, features, advanced 15355@c STARTOFRANGE advgaw 15356@cindex advanced features, @command{gawk} 15357@ignore 15358Contributed by: Peter Langston <pud!psl@bellcore.bellcore.com> 15359 15360 Found in Steve English's "signature" line: 15361 15362"Write documentation as if whoever reads it is a violent psychopath 15363who knows where you live." 15364@end ignore 15365@quotation 15366@i{Write documentation as if whoever reads it is 15367a violent psychopath who knows where you live.}@* 15368Steve English, as quoted by Peter Langston 15369@end quotation 15370 15371This @value{CHAPTER} discusses advanced features in @command{gawk}. 15372It's a bit of a ``grab bag'' of items that are otherwise unrelated 15373to each other. 15374First, a command-line option allows @command{gawk} to recognize 15375nondecimal numbers in input data, not just in @command{awk} 15376programs. Next, two-way I/O, discussed briefly in earlier parts of this 15377@value{DOCUMENT}, is described in full detail, along with the basics 15378of TCP/IP networking and BSD portal files. Finally, @command{gawk} 15379can @dfn{profile} an @command{awk} program, making it possible to tune 15380it for performance. 15381 15382@ref{Dynamic Extensions}, 15383discusses the ability to dynamically add new built-in functions to 15384@command{gawk}. As this feature is still immature and likely to change, 15385its description is relegated to an appendix. 15386 15387@menu 15388* Nondecimal Data:: Allowing nondecimal input data. 15389* Two-way I/O:: Two-way communications with another process. 15390* TCP/IP Networking:: Using @command{gawk} for network programming. 15391* Portal Files:: Using @command{gawk} with BSD portals. 15392* Profiling:: Profiling your @command{awk} programs. 15393@end menu 15394 15395@node Nondecimal Data 15396@section Allowing Nondecimal Input Data 15397@cindex @code{--non-decimal-data} option 15398@cindex advanced features, @command{gawk}, nondecimal input data 15399@c last comma does NOT start tertiary 15400@cindex input, data, nondecimal 15401@cindex constants, nondecimal 15402 15403If you run @command{gawk} with the @option{--non-decimal-data} option, 15404you can have nondecimal constants in your input data: 15405 15406@c line break here for small book format 15407@example 15408$ echo 0123 123 0x123 | 15409> gawk --non-decimal-data '@{ printf "%d, %d, %d\n", 15410> $1, $2, $3 @}' 15411@print{} 83, 123, 291 15412@end example 15413 15414For this feature to work, write your program so that 15415@command{gawk} treats your data as numeric: 15416 15417@example 15418$ echo 0123 123 0x123 | gawk '@{ print $1, $2, $3 @}' 15419@print{} 0123 123 0x123 15420@end example 15421 15422@noindent 15423The @code{print} statement treats its expressions as strings. 15424Although the fields can act as numbers when necessary, 15425they are still strings, so @code{print} does not try to treat them 15426numerically. You may need to add zero to a field to force it to 15427be treated as a number. For example: 15428 15429@example 15430$ echo 0123 123 0x123 | gawk --non-decimal-data ' 15431> @{ print $1, $2, $3 15432> print $1 + 0, $2 + 0, $3 + 0 @}' 15433@print{} 0123 123 0x123 15434@print{} 83 123 291 15435@end example 15436 15437Because it is common to have decimal data with leading zeros, and because 15438using it could lead to surprising results, the default is to leave this 15439facility disabled. If you want it, you must explicitly request it. 15440 15441@cindex programming conventions, @code{--non-decimal-data} option 15442@cindex @code{--non-decimal-data} option, @code{strtonum} function and 15443@cindex @code{strtonum} function (@command{gawk}), @code{--non-decimal-data} option and 15444@strong{Caution:} 15445@emph{Use of this option is not recommended.} 15446It can break old programs very badly. 15447Instead, use the @code{strtonum} function to convert your data 15448(@pxref{Nondecimal-numbers}). 15449This makes your programs easier to write and easier to read, and 15450leads to less surprising results. 15451 15452@node Two-way I/O 15453@section Two-Way Communications with Another Process 15454@cindex Brennan, Michael 15455@cindex programmers, attractiveness of 15456@smallexample 15457@c Path: cssun.mathcs.emory.edu!gatech!newsxfer3.itd.umich.edu!news-peer.sprintlink.net!news-sea-19.sprintlink.net!news-in-west.sprintlink.net!news.sprintlink.net!Sprint!204.94.52.5!news.whidbey.com!brennan 15458From: brennan@@whidbey.com (Mike Brennan) 15459Newsgroups: comp.lang.awk 15460Subject: Re: Learn the SECRET to Attract Women Easily 15461Date: 4 Aug 1997 17:34:46 GMT 15462@c Organization: WhidbeyNet 15463@c Lines: 12 15464Message-ID: <5s53rm$eca@@news.whidbey.com> 15465@c References: <5s20dn$2e1@chronicle.concentric.net> 15466@c Reply-To: brennan@whidbey.com 15467@c NNTP-Posting-Host: asn202.whidbey.com 15468@c X-Newsreader: slrn (0.9.4.1 UNIX) 15469@c Xref: cssun.mathcs.emory.edu comp.lang.awk:5403 15470 15471On 3 Aug 1997 13:17:43 GMT, Want More Dates??? 15472<tracy78@@kilgrona.com> wrote: 15473>Learn the SECRET to Attract Women Easily 15474> 15475>The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women 15476 15477The scent of awk programmers is a lot more attractive to women than 15478the scent of perl programmers. 15479-- 15480Mike Brennan 15481@c brennan@@whidbey.com 15482@end smallexample 15483 15484@c final comma is part of tertiary 15485@cindex advanced features, @command{gawk}, processes, communicating with 15486@cindex processes, two-way communications with 15487It is often useful to be able to 15488send data to a separate program for 15489processing and then read the result. This can always be 15490done with temporary files: 15491 15492@example 15493# write the data for processing 15494tempfile = ("mydata." PROCINFO["pid"]) 15495while (@var{not done with data}) 15496 print @var{data} | ("subprogram > " tempfile) 15497close("subprogram > " tempfile) 15498 15499# read the results, remove tempfile when done 15500while ((getline newdata < tempfile) > 0) 15501 @var{process} newdata @var{appropriately} 15502close(tempfile) 15503system("rm " tempfile) 15504@end example 15505 15506@noindent 15507This works, but not elegantly. Among other things, it requires that 15508the program be run in a directory that cannot be shared among users; 15509for example, @file{/tmp} will not do, as another user might happen 15510to be using a temporary file with the same name. 15511 15512@cindex coprocesses 15513@cindex input/output, two-way 15514@cindex @code{|} (vertical bar), @code{|&} operator (I/O) 15515@cindex vertical bar (@code{|}), @code{|&} I/O operator (I/O) 15516@cindex @command{csh} utility, @code{|&} operator, comparison with 15517Starting with @value{PVERSION} 3.1 of @command{gawk}, it is possible to 15518open a @emph{two-way} pipe to another process. The second process is 15519termed a @dfn{coprocess}, since it runs in parallel with @command{gawk}. 15520The two-way connection is created using the new @samp{|&} operator 15521(borrowed from the Korn shell, @command{ksh}):@footnote{This is very 15522different from the same operator in the C shell, @command{csh}.} 15523 15524@example 15525do @{ 15526 print @var{data} |& "subprogram" 15527 "subprogram" |& getline results 15528@} while (@var{data left to process}) 15529close("subprogram") 15530@end example 15531 15532The first time an I/O operation is executed using the @samp{|&} 15533operator, @command{gawk} creates a two-way pipeline to a child process 15534that runs the other program. Output created with @code{print} 15535or @code{printf} is written to the program's standard input, and 15536output from the program's standard output can be read by the @command{gawk} 15537program using @code{getline}. 15538As is the case with processes started by @samp{|}, the subprogram 15539can be any program, or pipeline of programs, that can be started by 15540the shell. 15541 15542There are some cautionary items to be aware of: 15543 15544@itemize @bullet 15545@item 15546As the code inside @command{gawk} currently stands, the coprocess's 15547standard error goes to the same place that the parent @command{gawk}'s 15548standard error goes. It is not possible to read the child's 15549standard error separately. 15550 15551@cindex deadlocks 15552@cindex buffering, input/output 15553@cindex @code{getline} command, deadlock and 15554@item 15555I/O buffering may be a problem. @command{gawk} automatically 15556flushes all output down the pipe to the child process. 15557However, if the coprocess does not flush its output, 15558@command{gawk} may hang when doing a @code{getline} in order to read 15559the coprocess's results. This could lead to a situation 15560known as @dfn{deadlock}, where each process is waiting for the 15561other one to do something. 15562@end itemize 15563 15564@cindex @code{close} function, two-way pipes and 15565It is possible to close just one end of the two-way pipe to 15566a coprocess, by supplying a second argument to the @code{close} 15567function of either @code{"to"} or @code{"from"} 15568(@pxref{Close Files And Pipes}). 15569These strings tell @command{gawk} to close the end of the pipe 15570that sends data to the process or the end that reads from it, 15571respectively. 15572 15573@cindex @command{sort} utility, coprocesses and 15574This is particularly necessary in order to use 15575the system @command{sort} utility as part of a coprocess; 15576@command{sort} must read @emph{all} of its input 15577data before it can produce any output. 15578The @command{sort} program does not receive an end-of-file indication 15579until @command{gawk} closes the write end of the pipe. 15580 15581When you have finished writing data to the @command{sort} 15582utility, you can close the @code{"to"} end of the pipe, and 15583then start reading sorted data via @code{getline}. 15584For example: 15585 15586@example 15587BEGIN @{ 15588 command = "LC_ALL=C sort" 15589 n = split("abcdefghijklmnopqrstuvwxyz", a, "") 15590 15591 for (i = n; i > 0; i--) 15592 print a[i] |& command 15593 close(command, "to") 15594 15595 while ((command |& getline line) > 0) 15596 print "got", line 15597 close(command) 15598@} 15599@end example 15600 15601This program writes the letters of the alphabet in reverse order, one 15602per line, down the two-way pipe to @command{sort}. It then closes the 15603write end of the pipe, so that @command{sort} receives an end-of-file 15604indication. This causes @command{sort} to sort the data and write the 15605sorted data back to the @command{gawk} program. Once all of the data 15606has been read, @command{gawk} terminates the coprocess and exits. 15607 15608As a side note, the assignment @samp{LC_ALL=C} in the @command{sort} 15609command ensures traditional Unix (ASCII) sorting from @command{sort}. 15610 15611Beginning with @command{gawk} 3.1.2, you may use Pseudo-ttys (ptys) for 15612two-way communication instead of pipes, if your system supports them. 15613This is done on a per-command basis, by setting a special element 15614in the @code{PROCINFO} array 15615(@pxref{Auto-set}), 15616like so: 15617 15618@example 15619command = "sort -nr" # command, saved in variable for convenience 15620PROCINFO[command, "pty"] = 1 # update PROCINFO 15621print @dots{} |& command # start two-way pipe 15622@dots{} 15623@end example 15624 15625@noindent 15626Using ptys avoids the buffer deadlock issues described earlier, at some 15627loss in performance. If your system does not have ptys, or if all the 15628system's ptys are in use, @command{gawk} automatically falls back to 15629using regular pipes. 15630 15631@node TCP/IP Networking 15632@section Using @command{gawk} for Network Programming 15633@cindex advanced features, @command{gawk}, network programming 15634@cindex networks, programming 15635@c STARTOFRANGE tcpip 15636@cindex TCP/IP 15637@cindex @code{/inet/} files (@command{gawk}) 15638@cindex files, @code{/inet/} (@command{gawk}) 15639@cindex @code{EMISTERED} 15640@quotation 15641@code{EMISTERED}: @i{A host is a host from coast to coast,@* 15642and no-one can talk to host that's close,@* 15643unless the host that isn't close@* 15644is busy hung or dead.} 15645@end quotation 15646 15647In addition to being able to open a two-way pipeline to a coprocess 15648on the same system 15649(@pxref{Two-way I/O}), 15650it is possible to make a two-way connection to 15651another process on another system across an IP networking connection. 15652 15653You can think of this as just a @emph{very long} two-way pipeline to 15654a coprocess. 15655The way @command{gawk} decides that you want to use TCP/IP networking is 15656by recognizing special @value{FN}s that begin with @samp{/inet/}. 15657 15658The full syntax of the special @value{FN} is 15659@file{/inet/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}. 15660The components are: 15661 15662@table @var 15663@item protocol 15664The protocol to use over IP. This must be either @samp{tcp}, 15665@samp{udp}, or @samp{raw}, for a TCP, UDP, or raw IP connection, 15666respectively. The use of TCP is recommended for most applications. 15667 15668@cindex raw sockets 15669@cindex sockets 15670@strong{Caution:} The use of raw sockets is not currently supported 15671in @value{PVERSION} 3.1 of @command{gawk}. 15672 15673@item local-port 15674@cindex @code{getservbyname} function (C library) 15675The local TCP or UDP port number to use. Use a port number of @samp{0} 15676when you want the system to pick a port. This is what you should do 15677when writing a TCP or UDP client. 15678You may also use a well-known service name, such as @samp{smtp} 15679or @samp{http}, in which case @command{gawk} attempts to determine 15680the predefined port number using the C @code{getservbyname} function. 15681 15682@item remote-host 15683The IP address or fully-qualified domain name of the Internet 15684host to which you want to connect. 15685 15686@item remote-port 15687The TCP or UDP port number to use on the given @var{remote-host}. 15688Again, use @samp{0} if you don't care, or else a well-known 15689service name. 15690@end table 15691 15692Consider the following very simple example: 15693 15694@example 15695BEGIN @{ 15696 Service = "/inet/tcp/0/localhost/daytime" 15697 Service |& getline 15698 print $0 15699 close(Service) 15700@} 15701@end example 15702 15703This program reads the current date and time from the local system's 15704TCP @samp{daytime} server. 15705It then prints the results and closes the connection. 15706 15707Because this topic is extensive, the use of @command{gawk} for 15708TCP/IP programming is documented separately. 15709@ifinfo 15710@xref{Top}, 15711@end ifinfo 15712@ifnotinfo 15713See @cite{TCP/IP Internetworking with @command{gawk}}, 15714which comes as part of the @command{gawk} distribution, 15715@end ifnotinfo 15716for a much more complete introduction and discussion, as well as 15717extensive examples. 15718 15719@node Portal Files 15720@section Using @command{gawk} with BSD Portals 15721@cindex advanced features, @command{gawk}, BSD portals 15722@cindex portal files 15723@cindex files, portal 15724@cindex BSD portals 15725@cindex @code{/p} files (@command{gawk}) 15726@cindex files, @code{/p} (@command{gawk}) 15727@cindex @code{--enable-portals} configuration option 15728@cindex operating systems, BSD-based 15729 15730Similar to the @file{/inet} special files, if @command{gawk} 15731is configured with the @option{--enable-portals} option 15732(@pxref{Quick Installation}), 15733then @command{gawk} treats 15734files whose pathnames begin with @code{/p} as 4.4 BSD-style portals. 15735 15736@cindex @code{|} (vertical bar), @code{|&} operator (I/O), two-way communications 15737@cindex vertical bar (@code{|}), @code{|&} operator (I/O), two-way communications 15738When used with the @samp{|&} operator, @command{gawk} opens the file 15739for two-way communications. The operating system's portal mechanism 15740then manages creating the process associated with the portal and 15741the corresponding communications with the portal's process. 15742@c ENDOFRANGE tcpip 15743 15744@node Profiling 15745@section Profiling Your @command{awk} Programs 15746@c STARTOFRANGE awkp 15747@cindex @command{awk} programs, profiling 15748@c STARTOFRANGE proawk 15749@cindex profiling @command{awk} programs 15750@c STARTOFRANGE pgawk 15751@cindex @command{pgawk} program 15752@cindex profiling @command{gawk}, See @command{pgawk} program 15753 15754Beginning with @value{PVERSION} 3.1 of @command{gawk}, you may produce execution 15755traces of your @command{awk} programs. 15756This is done with a specially compiled version of @command{gawk}, 15757called @command{pgawk} (``profiling @command{gawk}''). 15758 15759@cindex @code{awkprof.out} file 15760@cindex files, @code{awkprof.out} 15761@cindex @command{pgawk} program, @code{awkprof.out} file 15762@command{pgawk} is identical in every way to @command{gawk}, except that when 15763it has finished running, it creates a profile of your program in a file 15764named @file{awkprof.out}. 15765Because it is profiling, it also executes up to 45% slower than 15766@command{gawk} normally does. 15767 15768@cindex @code{--profile} option 15769As shown in the following example, 15770the @option{--profile} option can be used to change the name of the file 15771where @command{pgawk} will write the profile: 15772 15773@example 15774$ pgawk --profile=myprog.prof -f myprog.awk data1 data2 15775@end example 15776 15777@noindent 15778In the above example, @command{pgawk} places the profile in 15779@file{myprog.prof} instead of in @file{awkprof.out}. 15780 15781Regular @command{gawk} also accepts this option. When called with just 15782@option{--profile}, @command{gawk} ``pretty prints'' the program into 15783@file{awkprof.out}, without any execution counts. You may supply an 15784option to @option{--profile} to change the @value{FN}. Here is a sample 15785session showing a simple @command{awk} program, its input data, and the 15786results from running @command{pgawk}. First, the @command{awk} program: 15787 15788@example 15789BEGIN @{ print "First BEGIN rule" @} 15790 15791END @{ print "First END rule" @} 15792 15793/foo/ @{ 15794 print "matched /foo/, gosh" 15795 for (i = 1; i <= 3; i++) 15796 sing() 15797@} 15798 15799@{ 15800 if (/foo/) 15801 print "if is true" 15802 else 15803 print "else is true" 15804@} 15805 15806BEGIN @{ print "Second BEGIN rule" @} 15807 15808END @{ print "Second END rule" @} 15809 15810function sing( dummy) 15811@{ 15812 print "I gotta be me!" 15813@} 15814@end example 15815 15816Following is the input data: 15817 15818@example 15819foo 15820bar 15821baz 15822foo 15823junk 15824@end example 15825 15826Here is the @file{awkprof.out} that results from running @command{pgawk} 15827on this program and data (this example also illustrates that @command{awk} 15828programmers sometimes have to work late): 15829 15830@cindex @code{BEGIN} pattern, @command{pgawk} program 15831@cindex @code{END} pattern, @command{pgawk} program 15832@example 15833 # gawk profile, created Sun Aug 13 00:00:15 2000 15834 15835 # BEGIN block(s) 15836 15837 BEGIN @{ 15838 1 print "First BEGIN rule" 15839 1 print "Second BEGIN rule" 15840 @} 15841 15842 # Rule(s) 15843 15844 5 /foo/ @{ # 2 15845 2 print "matched /foo/, gosh" 15846 6 for (i = 1; i <= 3; i++) @{ 15847 6 sing() 15848 @} 15849 @} 15850 15851 5 @{ 15852 5 if (/foo/) @{ # 2 15853 2 print "if is true" 15854 3 @} else @{ 15855 3 print "else is true" 15856 @} 15857 @} 15858 15859 # END block(s) 15860 15861 END @{ 15862 1 print "First END rule" 15863 1 print "Second END rule" 15864 @} 15865 15866 # Functions, listed alphabetically 15867 15868 6 function sing(dummy) 15869 @{ 15870 6 print "I gotta be me!" 15871 @} 15872@end example 15873 15874This example illustrates many of the basic rules for profiling output. 15875The rules are as follows: 15876 15877@itemize @bullet 15878@item 15879The program is printed in the order @code{BEGIN} rule, 15880pattern/action rules, @code{END} rule and functions, listed 15881alphabetically. 15882Multiple @code{BEGIN} and @code{END} rules are merged together. 15883 15884@cindex patterns, counts 15885@item 15886Pattern-action rules have two counts. 15887The first count, to the left of the rule, shows how many times 15888the rule's pattern was @emph{tested}. 15889The second count, to the right of the rule's opening left brace 15890in a comment, 15891shows how many times the rule's action was @emph{executed}. 15892The difference between the two indicates how many times the rule's 15893pattern evaluated to false. 15894 15895@item 15896Similarly, 15897the count for an @code{if}-@code{else} statement shows how many times 15898the condition was tested. 15899To the right of the opening left brace for the @code{if}'s body 15900is a count showing how many times the condition was true. 15901The count for the @code{else} 15902indicates how many times the test failed. 15903 15904@cindex loops, count for header 15905@item 15906The count for a loop header (such as @code{for} 15907or @code{while}) shows how many times the loop test was executed. 15908(Because of this, you can't just look at the count on the first 15909statement in a rule to determine how many times the rule was executed. 15910If the first statement is a loop, the count is misleading.) 15911 15912@cindex functions, user-defined, counts 15913@cindex user-defined, functions, counts 15914@item 15915For user-defined functions, the count next to the @code{function} 15916keyword indicates how many times the function was called. 15917The counts next to the statements in the body show how many times 15918those statements were executed. 15919 15920@cindex @code{@{@}} (braces), @command{pgawk} program 15921@cindex braces (@code{@{@}}), @command{pgawk} program 15922@item 15923The layout uses ``K&R'' style with tabs. 15924Braces are used everywhere, even when 15925the body of an @code{if}, @code{else}, or loop is only a single statement. 15926 15927@cindex @code{()} (parentheses), @command{pgawk} program 15928@cindex parentheses @code{()}, @command{pgawk} program 15929@item 15930Parentheses are used only where needed, as indicated by the structure 15931of the program and the precedence rules. 15932@c extra verbiage here satisfies the copyeditor. ugh. 15933For example, @samp{(3 + 5) * 4} means add three plus five, then multiply 15934the total by four. However, @samp{3 + 5 * 4} has no parentheses, and 15935means @samp{3 + (5 * 4)}. 15936 15937@item 15938All string concatenations are parenthesized too. 15939(This could be made a bit smarter.) 15940 15941@item 15942Parentheses are used around the arguments to @code{print} 15943and @code{printf} only when 15944the @code{print} or @code{printf} statement is followed by a redirection. 15945Similarly, if 15946the target of a redirection isn't a scalar, it gets parenthesized. 15947 15948@item 15949@command{pgawk} supplies leading comments in 15950front of the @code{BEGIN} and @code{END} rules, 15951the pattern/action rules, and the functions. 15952 15953@end itemize 15954 15955The profiled version of your program may not look exactly like what you 15956typed when you wrote it. This is because @command{pgawk} creates the 15957profiled version by ``pretty printing'' its internal representation of 15958the program. The advantage to this is that @command{pgawk} can produce 15959a standard representation. The disadvantage is that all source-code 15960comments are lost, as are the distinctions among multiple @code{BEGIN} 15961and @code{END} rules. Also, things such as: 15962 15963@example 15964/foo/ 15965@end example 15966 15967@noindent 15968come out as: 15969 15970@example 15971/foo/ @{ 15972 print $0 15973@} 15974@end example 15975 15976@noindent 15977which is correct, but possibly surprising. 15978 15979@cindex profiling @command{awk} programs, dynamically 15980@cindex @command{pgawk} program, dynamic profiling 15981Besides creating profiles when a program has completed, 15982@command{pgawk} can produce a profile while it is running. 15983This is useful if your @command{awk} program goes into an 15984infinite loop and you want to see what has been executed. 15985To use this feature, run @command{pgawk} in the background: 15986 15987@example 15988$ pgawk -f myprog & 15989[1] 13992 15990@end example 15991 15992@c comma does NOT start secondary 15993@cindex @command{kill} command, dynamic profiling 15994@cindex @code{USR1} signal 15995@cindex signals, @code{USR1}/@code{SIGUSR1} 15996@noindent 15997The shell prints a job number and process ID number; in this case, 13992. 15998Use the @command{kill} command to send the @code{USR1} signal 15999to @command{pgawk}: 16000 16001@example 16002$ kill -USR1 13992 16003@end example 16004 16005@noindent 16006As usual, the profiled version of the program is written to 16007@file{awkprof.out}, or to a different file if you use the @option{--profile} 16008option. 16009 16010Along with the regular profile, as shown earlier, the profile 16011includes a trace of any active functions: 16012 16013@example 16014# Function Call Stack: 16015 16016# 3. baz 16017# 2. bar 16018# 1. foo 16019# -- main -- 16020@end example 16021 16022You may send @command{pgawk} the @code{USR1} signal as many times as you like. 16023Each time, the profile and function call trace are appended to the output 16024profile file. 16025 16026@cindex @code{HUP} signal 16027@cindex signals, @code{HUP}/@code{SIGHUP} 16028If you use the @code{HUP} signal instead of the @code{USR1} signal, 16029@command{pgawk} produces the profile and the function call trace and then exits. 16030 16031@cindex @code{INT} signal (MS-DOS) 16032@cindex signals, @code{INT}/@code{SIGINT} (MS-DOS) 16033@cindex @code{QUIT} signal (MS-DOS) 16034@cindex signals, @code{QUIT}/@code{SIGQUIT} (MS-DOS) 16035When @command{pgawk} runs on MS-DOS or MS-Windows, it uses the 16036@code{INT} and @code{QUIT} signals for producing the profile and, in 16037the case of the @code{INT} signal, @command{pgawk} exits. This is 16038because these systems don't support the @command{kill} command, so the 16039only signals you can deliver to a program are those generated by the 16040keyboard. The @code{INT} signal is generated by the 16041@kbd{@value{CTL}-@key{C}} or @kbd{@value{CTL}-@key{BREAK}} key, while the 16042@code{QUIT} signal is generated by the @kbd{@value{CTL}-@key{\}} key. 16043@c ENDOFRANGE advgaw 16044@c ENDOFRANGE gawadv 16045@c ENDOFRANGE pgawk 16046@c ENDOFRANGE awkp 16047@c ENDOFRANGE proawk 16048 16049@node Invoking Gawk 16050@chapter Running @command{awk} and @command{gawk} 16051 16052This @value{CHAPTER} covers how to run awk, both POSIX-standard 16053and @command{gawk}-specific command-line options, and what 16054@command{awk} and 16055@command{gawk} do with non-option arguments. 16056It then proceeds to cover how @command{gawk} searches for source files, 16057obsolete options and/or features, and known bugs in @command{gawk}. 16058This @value{CHAPTER} rounds out the discussion of @command{awk} 16059as a program and as a language. 16060 16061While a number of the options and features described here were 16062discussed in passing earlier in the book, this @value{CHAPTER} provides the 16063full details. 16064 16065@menu 16066* Command Line:: How to run @command{awk}. 16067* Options:: Command-line options and their meanings. 16068* Other Arguments:: Input file names and variable assignments. 16069* AWKPATH Variable:: Searching directories for @command{awk} 16070 programs. 16071* Obsolete:: Obsolete Options and/or features. 16072* Undocumented:: Undocumented Options and Features. 16073* Known Bugs:: Known Bugs in @command{gawk}. 16074@end menu 16075 16076@node Command Line 16077@section Invoking @command{awk} 16078@cindex command line, invoking @command{awk} from 16079@cindex @command{awk}, invoking 16080@cindex arguments, command-line, invoking @command{awk} 16081@cindex options, command-line, invoking @command{awk} 16082 16083There are two ways to run @command{awk}---with an explicit program or with 16084one or more program files. Here are templates for both of them; items 16085enclosed in [@dots{}] in these templates are optional: 16086 16087@example 16088awk @r{[@var{options}]} -f progfile @r{[@code{--}]} @var{file} @dots{} 16089awk @r{[@var{options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{} 16090@end example 16091 16092@cindex GNU long options 16093@cindex long options 16094@cindex options, long 16095Besides traditional one-letter POSIX-style options, @command{gawk} also 16096supports GNU long options. 16097 16098@cindex dark corner, invoking @command{awk} 16099@cindex lint checking, empty programs 16100It is possible to invoke @command{awk} with an empty program: 16101 16102@example 16103awk '' datafile1 datafile2 16104@end example 16105 16106@cindex @code{--lint} option 16107@noindent 16108Doing so makes little sense, though; @command{awk} exits 16109silently when given an empty program. 16110@value{DARKCORNER} 16111If @option{--lint} has 16112been specified on the command line, @command{gawk} issues a 16113warning that the program is empty. 16114 16115@node Options 16116@section Command-Line Options 16117@c STARTOFRANGE ocl 16118@cindex options, command-line 16119@c STARTOFRANGE clo 16120@cindex command line, options 16121@c STARTOFRANGE gnulo 16122@cindex GNU long options 16123@c STARTOFRANGE longo 16124@cindex options, long 16125 16126Options begin with a dash and consist of a single character. 16127GNU-style long options consist of two dashes and a keyword. 16128The keyword can be abbreviated, as long as the abbreviation allows the option 16129to be uniquely identified. If the option takes an argument, then the 16130keyword is either immediately followed by an equals sign (@samp{=}) and the 16131argument's value, or the keyword and the argument's value are separated 16132by whitespace. 16133If a particular option with a value is given more than once, it is the 16134last value that counts. 16135 16136@cindex POSIX @command{awk}, GNU long options and 16137Each long option for @command{gawk} has a corresponding 16138POSIX-style option. 16139The long and short options are 16140interchangeable in all contexts. 16141The options and their meanings are as follows: 16142 16143@table @code 16144@item -F @var{fs} 16145@itemx --field-separator @var{fs} 16146@cindex @code{-F} option 16147@cindex @code{--field-separator} option 16148@cindex @code{FS} variable, @code{--field-separator} option and 16149Sets the @code{FS} variable to @var{fs} 16150(@pxref{Field Separators}). 16151 16152@item -f @var{source-file} 16153@itemx --file @var{source-file} 16154@cindex @code{-f} option 16155@cindex @code{--file} option 16156@cindex @command{awk} programs, location of 16157Indicates that the @command{awk} program is to be found in @var{source-file} 16158instead of in the first non-option argument. 16159 16160@item -v @var{var}=@var{val} 16161@itemx --assign @var{var}=@var{val} 16162@cindex @code{-v} option 16163@cindex @code{--assign} option 16164@cindex variables, setting 16165Sets the variable @var{var} to the value @var{val} @emph{before} 16166execution of the program begins. Such variable values are available 16167inside the @code{BEGIN} rule 16168(@pxref{Other Arguments}). 16169 16170The @option{-v} option can only set one variable, but it can be used 16171more than once, setting another variable each time, like this: 16172@samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}. 16173 16174@c last comma is part of secondary 16175@cindex built-in variables, @code{-v} option, setting with 16176@c last comma is part of tertiary 16177@cindex variables, built-in, @code{-v} option, setting with 16178@strong{Caution:} Using @option{-v} to set the values of the built-in 16179variables may lead to surprising results. @command{awk} will reset the 16180values of those variables as it needs to, possibly ignoring any 16181predefined value you may have given. 16182 16183@item -mf @var{N} 16184@itemx -mr @var{N} 16185@cindex @code{-mf}/@code{-mr} options 16186@cindex memory, setting limits 16187Sets various memory limits to the value @var{N}. The @samp{f} flag sets 16188the maximum number of fields and the @samp{r} flag sets the maximum 16189record size. These two flags and the @option{-m} option are from the 16190Bell Laboratories research version of Unix @command{awk}. They are provided 16191for compatibility but otherwise ignored by 16192@command{gawk}, since @command{gawk} has no predefined limits. 16193(The Bell Laboratories @command{awk} no longer needs these options; 16194it continues to accept them to avoid breaking old programs.) 16195 16196@item -W @var{gawk-opt} 16197@cindex @code{-W} option 16198Following the POSIX standard, implementation-specific 16199options are supplied as arguments to the @option{-W} option. These options 16200also have corresponding GNU-style long options. 16201Note that the long options may be abbreviated, as long as 16202the abbreviations remain unique. 16203The full list of @command{gawk}-specific options is provided next. 16204 16205@item -- 16206@cindex command line, options, end of 16207@cindex options, command-line, end of 16208Signals the end of the command-line options. The following arguments 16209are not treated as options even if they begin with @samp{-}. This 16210interpretation of @option{--} follows the POSIX argument parsing 16211conventions. 16212 16213@cindex @code{-} (hyphen), filenames beginning with 16214@cindex hyphen (@code{-}), filenames beginning with 16215This is useful if you have @value{FN}s that start with @samp{-}, 16216or in shell scripts, if you have @value{FN}s that will be specified 16217by the user that could start with @samp{-}. 16218@end table 16219@c ENDOFRANGE gnulo 16220@c ENDOFRANGE longo 16221 16222The previous list described options mandated by the POSIX standard, 16223as well as options available in the Bell Laboratories version of @command{awk}. 16224The following list describes @command{gawk}-specific options: 16225 16226@table @code 16227@item -W compat 16228@itemx -W traditional 16229@itemx --compat 16230@itemx --traditional 16231@cindex @code{--compat} option 16232@cindex @code{--traditional} option 16233@cindex compatibility mode (@command{gawk}), specifying 16234Specifies @dfn{compatibility mode}, in which the GNU extensions to 16235the @command{awk} language are disabled, so that @command{gawk} behaves just 16236like the Bell Laboratories research version of Unix @command{awk}. 16237@option{--traditional} is the preferred form of this option. 16238@xref{POSIX/GNU}, 16239which summarizes the extensions. Also see 16240@ref{Compatibility Mode}. 16241 16242@item -W copyright 16243@itemx --copyright 16244@cindex @code{--copyright} option 16245@cindex GPL (General Public License), printing 16246Print the short version of the General Public License and then exit. 16247 16248@item -W copyleft 16249@itemx --copyleft 16250@cindex @code{--copyleft} option 16251Just like @option{--copyright}. 16252This option may disappear in a future version of @command{gawk}. 16253 16254@cindex @code{--dump-variables} option 16255@cindex @code{awkvars.out} file 16256@cindex files, @code{awkvars.out} 16257@cindex variables, global, printing list of 16258@item -W dump-variables@r{[}=@var{file}@r{]} 16259@itemx --dump-variables@r{[}=@var{file}@r{]} 16260Prints a sorted list of global variables, their types, and final values 16261to @var{file}. If no @var{file} is provided, @command{gawk} prints this 16262list to the file named @file{awkvars.out} in the current directory. 16263 16264@c last comma is part of secondary 16265@cindex troubleshooting, typographical errors, global variables 16266Having a list of all global variables is a good way to look for 16267typographical errors in your programs. 16268You would also use this option if you have a large program with a lot of 16269functions, and you want to be sure that your functions don't 16270inadvertently use global variables that you meant to be local. 16271(This is a particularly easy mistake to make with simple variable 16272names like @code{i}, @code{j}, etc.) 16273 16274@item -W gen-po 16275@itemx --gen-po 16276@cindex @code{--gen-po} option 16277@cindex portable object files, generating 16278@cindex files, portable object, generating 16279Analyzes the source program and 16280generates a GNU @code{gettext} Portable Object file on standard 16281output for all string constants that have been marked for translation. 16282@xref{Internationalization}, 16283for information about this option. 16284 16285@item -W help 16286@itemx -W usage 16287@itemx --help 16288@itemx --usage 16289@cindex @code{--help} option 16290@cindex @code{--usage} option 16291@cindex GNU long options, printing list of 16292@cindex options, printing list of 16293@cindex printing, list of options 16294Prints a ``usage'' message summarizing the short and long style options 16295that @command{gawk} accepts and then exit. 16296 16297@item -W lint@r{[}=fatal@r{]} 16298@itemx --lint@r{[}=fatal@r{]} 16299@cindex @code{--lint} option 16300@cindex lint checking, issuing warnings 16301@cindex warnings, issuing 16302Warns about constructs that are dubious or nonportable to 16303other @command{awk} implementations. 16304Some warnings are issued when @command{gawk} first reads your program. Others 16305are issued at runtime, as your program executes. 16306With an optional argument of @samp{fatal}, 16307lint warnings become fatal errors. 16308This may be drastic, but its use will certainly encourage the 16309development of cleaner @command{awk} programs. 16310With an optional argument of @samp{invalid}, only warnings about things that are 16311actually invalid are issued. (This is not fully implemented yet.) 16312 16313@item -W lint-old 16314@itemx --lint-old 16315@cindex @code{--lint-old} option 16316Warns about constructs that are not available in the original version of 16317@command{awk} from Version 7 Unix 16318(@pxref{V7/SVR3.1}). 16319 16320@item -W non-decimal-data 16321@itemx --non-decimal-data 16322@cindex @code{--non-decimal-data} option 16323@cindex hexadecimal, values, enabling interpretation of 16324@c comma is part of primary 16325@cindex octal values, enabling interpretation of 16326Enable automatic interpretation of octal and hexadecimal 16327values in input data 16328(@pxref{Nondecimal Data}). 16329 16330@cindex troubleshooting, @code{--non-decimal-data} option 16331@strong{Caution:} This option can severely break old programs. 16332Use with care. 16333 16334@item -W posix 16335@itemx --posix 16336@cindex @code{--posix} option 16337@cindex POSIX mode 16338@c last comma is part of tertiary 16339@cindex @command{gawk}, extensions, disabling 16340Operates in strict POSIX mode. This disables all @command{gawk} 16341extensions (just like @option{--traditional}) and adds the following additional 16342restrictions: 16343 16344@c IMPORTANT! Keep this list in sync with the one in node POSIX 16345 16346@itemize @bullet 16347@cindex escape sequences, unrecognized 16348@item 16349@code{\x} escape sequences are not recognized 16350(@pxref{Escape Sequences}). 16351 16352@cindex newlines 16353@cindex whitespace, newlines as 16354@item 16355Newlines do not act as whitespace to separate fields when @code{FS} is 16356equal to a single space 16357(@pxref{Fields}). 16358 16359@item 16360Newlines are not allowed after @samp{?} or @samp{:} 16361(@pxref{Conditional Exp}). 16362 16363@item 16364The synonym @code{func} for the keyword @code{function} is not 16365recognized (@pxref{Definition Syntax}). 16366 16367@cindex @code{*} (asterisk), @code{**} operator 16368@cindex asterisk (@code{*}), @code{**} operator 16369@cindex @code{*} (asterisk), @code{**=} operator 16370@cindex asterisk (@code{*}), @code{**=} operator 16371@cindex @code{^} (caret), @code{^} operator 16372@cindex caret (@code{^}), @code{^} operator 16373@cindex @code{^} (caret), @code{^=} operator 16374@cindex caret (@code{^}), @code{^=} operator 16375@item 16376The @samp{**} and @samp{**=} operators cannot be used in 16377place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops}, 16378and also @pxref{Assignment Ops}). 16379 16380@cindex @code{FS} variable, as TAB character 16381@item 16382Specifying @samp{-Ft} on the command-line does not set the value 16383of @code{FS} to be a single TAB character 16384(@pxref{Field Separators}). 16385 16386@c comma does not start secondary 16387@cindex @code{fflush} function, unsupported 16388@item 16389The @code{fflush} built-in function is not supported 16390(@pxref{I/O Functions}). 16391@end itemize 16392 16393@c @cindex automatic warnings 16394@c @cindex warnings, automatic 16395@cindex @code{--traditional} option, @code{--posix} option and 16396@cindex @code{--posix} option, @code{--traditional} option and 16397If you supply both @option{--traditional} and @option{--posix} on the 16398command line, @option{--posix} takes precedence. @command{gawk} 16399also issues a warning if both options are supplied. 16400 16401@item -W profile@r{[}=@var{file}@r{]} 16402@itemx --profile@r{[}=@var{file}@r{]} 16403@cindex @code{--profile} option 16404@cindex @command{awk} programs, profiling, enabling 16405Enable profiling of @command{awk} programs 16406(@pxref{Profiling}). 16407By default, profiles are created in a file named @file{awkprof.out}. 16408The optional @var{file} argument allows you to specify a different 16409@value{FN} for the profile file. 16410 16411When run with @command{gawk}, the profile is just a ``pretty printed'' version 16412of the program. When run with @command{pgawk}, the profile contains execution 16413counts for each statement in the program in the left margin, and function 16414call counts for each function. 16415 16416@item -W re-interval 16417@itemx --re-interval 16418@cindex @code{--re-interval} option 16419@cindex regular expressions, interval expressions and 16420Allows interval expressions 16421(@pxref{Regexp Operators}) 16422in regexps. 16423Because interval expressions were traditionally not available in @command{awk}, 16424@command{gawk} does not provide them by default. This prevents old @command{awk} 16425programs from breaking. 16426 16427@item -W source @var{program-text} 16428@itemx --source @var{program-text} 16429@cindex @code{--source} option 16430@cindex source code, mixing 16431Allows you to mix source code in files with source 16432code that you enter on the command line. 16433Program source code is taken from the @var{program-text}. 16434This is particularly useful 16435when you have library functions that you want to use from your command-line 16436programs (@pxref{AWKPATH Variable}). 16437 16438@item -W version 16439@itemx --version 16440@cindex @code{--version} option 16441@c last comma is part of tertiary 16442@cindex @command{gawk}, versions of, information about, printing 16443Prints version information for this particular copy of @command{gawk}. 16444This allows you to determine if your copy of @command{gawk} is up to date 16445with respect to whatever the Free Software Foundation is currently 16446distributing. 16447It is also useful for bug reports 16448(@pxref{Bugs}). 16449@end table 16450 16451As long as program text has been supplied, 16452any other options are flagged as invalid with a warning message but 16453are otherwise ignored. 16454 16455@cindex @code{-F} option, @code{-Ft} sets @code{FS} to TAB 16456In compatibility mode, as a special case, if the value of @var{fs} supplied 16457to the @option{-F} option is @samp{t}, then @code{FS} is set to the TAB 16458character (@code{"\t"}). This is true only for @option{--traditional} and not 16459for @option{--posix} 16460(@pxref{Field Separators}). 16461 16462@cindex @code{-f} option, on command line 16463The @option{-f} option may be used more than once on the command line. 16464If it is, @command{awk} reads its program source from all of the named files, as 16465if they had been concatenated together into one big file. This is 16466useful for creating libraries of @command{awk} functions. These functions 16467can be written once and then retrieved from a standard place, instead 16468of having to be included into each individual program. 16469(As mentioned in 16470@ref{Definition Syntax}, 16471function names must be unique.) 16472 16473Library functions can still be used, even if the program is entered at the terminal, 16474by specifying @samp{-f /dev/tty}. After typing your program, 16475type @kbd{@value{CTL}-d} (the end-of-file character) to terminate it. 16476(You may also use @samp{-f -} to read program source from the standard 16477input but then you will not be able to also use the standard input as a 16478source of data.) 16479 16480Because it is clumsy using the standard @command{awk} mechanisms to mix source 16481file and command-line @command{awk} programs, @command{gawk} provides the 16482@option{--source} option. This does not require you to pre-empt the standard 16483input for your source code; it allows you to easily mix command-line 16484and library source code 16485(@pxref{AWKPATH Variable}). 16486 16487@cindex @code{--source} option 16488If no @option{-f} or @option{--source} option is specified, then @command{gawk} 16489uses the first non-option command-line argument as the text of the 16490program source code. 16491 16492@cindex @code{POSIXLY_CORRECT} environment variable 16493@cindex lint checking, @code{POSIXLY_CORRECT} environment variable 16494@cindex POSIX mode 16495If the environment variable @env{POSIXLY_CORRECT} exists, 16496then @command{gawk} behaves in strict POSIX mode, exactly as if 16497you had supplied the @option{--posix} command-line option. 16498Many GNU programs look for this environment variable to turn on 16499strict POSIX mode. If @option{--lint} is supplied on the command line 16500and @command{gawk} turns on POSIX mode because of @env{POSIXLY_CORRECT}, 16501then it issues a warning message indicating that POSIX 16502mode is in effect. 16503You would typically set this variable in your shell's startup file. 16504For a Bourne-compatible shell (such as @command{bash}), you would add these 16505lines to the @file{.profile} file in your home directory: 16506 16507@example 16508POSIXLY_CORRECT=true 16509export POSIXLY_CORRECT 16510@end example 16511 16512@cindex @command{csh} utility, @code{POSIXLY_CORRECT} environment variable 16513For a @command{csh}-compatible 16514shell,@footnote{Not recommended.} 16515you would add this line to the @file{.login} file in your home directory: 16516 16517@example 16518setenv POSIXLY_CORRECT true 16519@end example 16520 16521@cindex portability, @code{POSIXLY_CORRECT} environment variable 16522Having @env{POSIXLY_CORRECT} set is not recommended for daily use, 16523but it is good for testing the portability of your programs to other 16524environments. 16525@c ENDOFRANGE ocl 16526@c ENDOFRANGE clo 16527 16528@node Other Arguments 16529@section Other Command-Line Arguments 16530@cindex command line, arguments 16531@cindex arguments, command-line 16532 16533Any additional arguments on the command line are normally treated as 16534input files to be processed in the order specified. However, an 16535argument that has the form @code{@var{var}=@var{value}}, assigns 16536the value @var{value} to the variable @var{var}---it does not specify a 16537file at all. 16538(This was discussed earlier in 16539@ref{Assignment Options}.) 16540 16541@cindex @code{ARGIND} variable, command-line arguments 16542@cindex @code{ARGC}/@code{ARGV} variables, command-line arguments 16543All these arguments are made available to your @command{awk} program in the 16544@code{ARGV} array (@pxref{Built-in Variables}). Command-line options 16545and the program text (if present) are omitted from @code{ARGV}. 16546All other arguments, including variable assignments, are 16547included. As each element of @code{ARGV} is processed, @command{gawk} 16548sets the variable @code{ARGIND} to the index in @code{ARGV} of the 16549current element. 16550 16551@cindex input files, variable assignments and 16552The distinction between @value{FN} arguments and variable-assignment 16553arguments is made when @command{awk} is about to open the next input file. 16554At that point in execution, it checks the @value{FN} to see whether 16555it is really a variable assignment; if so, @command{awk} sets the variable 16556instead of reading a file. 16557 16558Therefore, the variables actually receive the given values after all 16559previously specified files have been read. In particular, the values of 16560variables assigned in this fashion are @emph{not} available inside a 16561@code{BEGIN} rule 16562(@pxref{BEGIN/END}), 16563because such rules are run before @command{awk} begins scanning the argument list. 16564 16565@cindex dark corner, escape sequences 16566The variable values given on the command line are processed for escape 16567sequences (@pxref{Escape Sequences}). 16568@value{DARKCORNER} 16569 16570In some earlier implementations of @command{awk}, when a variable assignment 16571occurred before any @value{FN}s, the assignment would happen @emph{before} 16572the @code{BEGIN} rule was executed. @command{awk}'s behavior was thus 16573inconsistent; some command-line assignments were available inside the 16574@code{BEGIN} rule, while others were not. Unfortunately, 16575some applications came to depend 16576upon this ``feature.'' When @command{awk} was changed to be more consistent, 16577the @option{-v} option was added to accommodate applications that depended 16578upon the old behavior. 16579 16580The variable assignment feature is most useful for assigning to variables 16581such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and 16582output formats before scanning the @value{DF}s. It is also useful for 16583controlling state if multiple passes are needed over a @value{DF}. For 16584example: 16585 16586@cindex files, multiple passes over 16587@example 16588awk 'pass == 1 @{ @var{pass 1 stuff} @} 16589 pass == 2 @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata 16590@end example 16591 16592Given the variable assignment feature, the @option{-F} option for setting 16593the value of @code{FS} is not 16594strictly necessary. It remains for historical compatibility. 16595 16596@node AWKPATH Variable 16597@section The @env{AWKPATH} Environment Variable 16598@cindex @env{AWKPATH} environment variable 16599@cindex directories, searching 16600@cindex search paths, for source files 16601@cindex differences in @command{awk} and @command{gawk}, @code{AWKPATH} environment variable 16602@ifinfo 16603The previous @value{SECTION} described how @command{awk} program files can be named 16604on the command-line with the @option{-f} option. 16605@end ifinfo 16606In most @command{awk} 16607implementations, you must supply a precise path name for each program 16608file, unless the file is in the current directory. 16609But in @command{gawk}, if the @value{FN} supplied to the @option{-f} option 16610does not contain a @samp{/}, then @command{gawk} searches a list of 16611directories (called the @dfn{search path}), one by one, looking for a 16612file with the specified name. 16613 16614The search path is a string consisting of directory names 16615separated by colons. @command{gawk} gets its search path from the 16616@env{AWKPATH} environment variable. If that variable does not exist, 16617@command{gawk} uses a default path, 16618@samp{.:/usr/local/share/awk}.@footnote{Your version of @command{gawk} 16619may use a different directory; it 16620will depend upon how @command{gawk} was built and installed. The actual 16621directory is the value of @samp{$(datadir)} generated when 16622@command{gawk} was configured. You probably don't need to worry about this, 16623though.} (Programs written for use by 16624system administrators should use an @env{AWKPATH} variable that 16625does not include the current directory, @file{.}.) 16626 16627The search path feature is particularly useful for building libraries 16628of useful @command{awk} functions. The library files can be placed in a 16629standard directory in the default path and then specified on 16630the command line with a short @value{FN}. Otherwise, the full @value{FN} 16631would have to be typed for each file. 16632 16633By using both the @option{--source} and @option{-f} options, your command-line 16634@command{awk} programs can use facilities in @command{awk} library files 16635(@pxref{Library Functions}). 16636Path searching is not done if @command{gawk} is in compatibility mode. 16637This is true for both @option{--traditional} and @option{--posix}. 16638@xref{Options}. 16639 16640@strong{Note:} If you want files in the current directory to be found, 16641you must include the current directory in the path, either by including 16642@file{.} explicitly in the path or by writing a null entry in the 16643path. (A null entry is indicated by starting or ending the path with a 16644colon or by placing two colons next to each other (@samp{::}).) If the 16645current directory is not included in the path, then files cannot be 16646found in the current directory. This path search mechanism is identical 16647to the shell's. 16648@c someday, @cite{The Bourne Again Shell}.... 16649 16650Starting with @value{PVERSION} 3.0, if @env{AWKPATH} is not defined in the 16651environment, @command{gawk} places its default search path into 16652@code{ENVIRON["AWKPATH"]}. This makes it easy to determine 16653the actual search path that @command{gawk} will use 16654from within an @command{awk} program. 16655 16656While you can change @code{ENVIRON["AWKPATH"]} within your @command{awk} 16657program, this has no effect on the running program's behavior. This makes 16658sense: the @env{AWKPATH} environment variable is used to find the program 16659source files. Once your program is running, all the files have been 16660found, and @command{gawk} no longer needs to use @env{AWKPATH}. 16661 16662@node Obsolete 16663@section Obsolete Options and/or Features 16664 16665@cindex features, advanced, See advanced features 16666@cindex options, deprecated 16667@cindex features, deprecated 16668@cindex obsolete features 16669This @value{SECTION} describes features and/or command-line options from 16670previous releases of @command{gawk} that are either not available in the 16671current version or that are still supported but deprecated (meaning that 16672they will @emph{not} be in the next release). 16673 16674@c update this section for each release! 16675 16676@cindex @code{next file} statement, deprecated 16677@cindex @code{nextfile} statement, @code{next file} statement and 16678For @value{PVERSION} @value{VERSION} of @command{gawk}, there are no 16679deprecated command-line options 16680@c or other deprecated features 16681from the previous version of @command{gawk}. 16682The use of @samp{next file} (two words) for @code{nextfile} was deprecated 16683in @command{gawk} 3.0 but still worked. Starting with @value{PVERSION} 3.1, the 16684two-word usage is no longer accepted. 16685 16686The process-related special files described in 16687@ref{Special Process}, 16688work as described, but 16689are now considered deprecated. 16690@command{gawk} prints a warning message every time they are used. 16691(Use @code{PROCINFO} instead; see 16692@ref{Auto-set}.) 16693They will be removed from the next release of @command{gawk}. 16694 16695@ignore 16696This @value{SECTION} 16697is thus essentially a place holder, 16698in case some option becomes obsolete in a future version of @command{gawk}. 16699@end ignore 16700 16701@node Undocumented 16702@section Undocumented Options and Features 16703@cindex undocumented features 16704@cindex features, undocumented 16705@cindex Skywalker, Luke 16706@cindex Kenobi, Obi-Wan 16707@cindex Jedi knights 16708@cindex Knights, jedi 16709@quotation 16710@i{Use the Source, Luke!}@* 16711Obi-Wan 16712@end quotation 16713 16714This @value{SECTION} intentionally left 16715blank. 16716 16717@ignore 16718@c If these came out in the Info file or TeX document, then they wouldn't 16719@c be undocumented, would they? 16720 16721@command{gawk} has one undocumented option: 16722 16723@table @code 16724@item -W nostalgia 16725@itemx --nostalgia 16726Print the message @code{"awk: bailing out near line 1"} and dump core. 16727This option was inspired by the common behavior of very early versions of 16728Unix @command{awk} and by a t--shirt. 16729The message is @emph{not} subject to translation in non-English locales. 16730@c so there! nyah, nyah. 16731@end table 16732 16733Early versions of @command{awk} used to not require any separator (either 16734a newline or @samp{;}) between the rules in @command{awk} programs. Thus, 16735it was common to see one-line programs like: 16736 16737@example 16738awk '@{ sum += $1 @} END @{ print sum @}' 16739@end example 16740 16741@command{gawk} actually supports this but it is purposely undocumented 16742because it is considered bad style. The correct way to write such a program 16743is either 16744 16745@example 16746awk '@{ sum += $1 @} ; END @{ print sum @}' 16747@end example 16748 16749@noindent 16750or 16751 16752@example 16753awk '@{ sum += $1 @} 16754 END @{ print sum @}' data 16755@end example 16756 16757@noindent 16758@xref{Statements/Lines}, for a fuller 16759explanation. 16760 16761You can insert newlines after the @samp{;} in @code{for} loops. 16762This seems to have been a long-undocumented feature in Unix @command{awk}. 16763 16764Similarly, you may use @code{print} or @code{printf} statements in the 16765@var{init} and @var{increment} parts of a @code{for} loop. This is another 16766long-undocumented ``feature'' of Unix @code{awk}. 16767 16768If the environment variable @env{WHINY_USERS} exists 16769when @command{gawk} is run, 16770then the associative @code{for} loop will go through the array 16771indices in sorted order. 16772The comparison used for sorting is simple string comparison; 16773any non-English or non-ASCII locales are not taken into account. 16774@code{IGNORECASE} does not affect the comparison either. 16775 16776In addition, if @env{WHINY_USERS} is set, the profiled version of a 16777program generated by @option{--profile} will print all 8-bit characters 16778verbatim, instead of using the octal equivalent. 16779 16780@end ignore 16781 16782@node Known Bugs 16783@section Known Bugs in @command{gawk} 16784@cindex @command{gawk}, debugging 16785@cindex debugging @command{gawk} 16786@cindex troubleshooting, @command{gawk} 16787 16788@itemize @bullet 16789@cindex troubleshooting, @code{-F} option 16790@cindex @code{-F} option, troubleshooting 16791@cindex @code{FS} variable, changing value of 16792@item 16793The @option{-F} option for changing the value of @code{FS} 16794(@pxref{Options}) 16795is not necessary given the command-line variable 16796assignment feature; it remains only for backward compatibility. 16797 16798@item 16799Syntactically invalid single-character programs tend to overflow 16800the parse stack, generating a rather unhelpful message. Such programs 16801are surprisingly difficult to diagnose in the completely general case, 16802and the effort to do so really is not worth it. 16803@end itemize 16804 16805@ignore 16806@c Try this 16807@iftex 16808@page 16809@headings off 16810@majorheading II@ @ @ Using @command{awk} and @command{gawk} 16811Part II shows how to use @command{awk} and @command{gawk} for problem solving. 16812There is lots of code here for you to read and learn from. 16813It contains the following chapters: 16814 16815@itemize @bullet 16816@item 16817@ref{Library Functions}. 16818 16819@item 16820@ref{Sample Programs}. 16821 16822@end itemize 16823 16824@page 16825@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| 16826@oddheading @| @| @strong{@thischapter}@ @ @ @thispage 16827@end iftex 16828@end ignore 16829 16830@node Library Functions 16831@chapter A Library of @command{awk} Functions 16832@c STARTOFRANGE libf 16833@cindex libraries of @command{awk} functions 16834@c STARTOFRANGE flib 16835@cindex functions, library 16836@c STARTOFRANGE fudlib 16837@cindex functions, user-defined, library of 16838 16839@ref{User-defined}, describes how to write 16840your own @command{awk} functions. Writing functions is important, because 16841it allows you to encapsulate algorithms and program tasks in a single 16842place. It simplifies programming, making program development more 16843manageable, and making programs more readable. 16844 16845One valuable way to learn a new programming language is to @emph{read} 16846programs in that language. To that end, this @value{CHAPTER} 16847and @ref{Sample Programs}, 16848provide a good-sized body of code for you to read, 16849and hopefully, to learn from. 16850 16851@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!! 16852This @value{CHAPTER} presents a library of useful @command{awk} functions. 16853Many of the sample programs presented later in this @value{DOCUMENT} 16854use these functions. 16855The functions are presented here in a progression from simple to complex. 16856 16857@cindex Texinfo 16858@ref{Extract Program}, 16859presents a program that you can use to extract the source code for 16860these example library functions and programs from the Texinfo source 16861for this @value{DOCUMENT}. 16862(This has already been done as part of the @command{gawk} distribution.) 16863 16864If you have written one or more useful, general-purpose @command{awk} functions 16865and would like to contribute them to the author's collection of @command{awk} 16866programs, see 16867@ref{How To Contribute}, for more information. 16868 16869@cindex portability, example programs 16870The programs in this @value{CHAPTER} and in 16871@ref{Sample Programs}, 16872freely use features that are @command{gawk}-specific. 16873Rewriting these programs for different implementations of awk is pretty straightforward. 16874 16875Diagnostic error messages are sent to @file{/dev/stderr}. 16876Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"} if your system 16877does not have a @file{/dev/stderr}, or if you cannot use @command{gawk}. 16878 16879A number of programs use @code{nextfile} 16880(@pxref{Nextfile Statement}) 16881to skip any remaining input in the input file. 16882@ref{Nextfile Function}, 16883shows you how to write a function that does the same thing. 16884 16885@c 12/2000: Thanks to Nelson Beebe for pointing out the output issue. 16886@cindex case sensitivity, example programs 16887@cindex @code{IGNORECASE} variable, in example programs 16888Finally, some of the programs choose to ignore upper- and lowercase 16889distinctions in their input. They do so by assigning one to @code{IGNORECASE}. 16890You can achieve almost the same effect@footnote{The effects are 16891not identical. Output of the transformed 16892record will be in all lowercase, while @code{IGNORECASE} preserves the original 16893contents of the input record.} by adding the following rule to the 16894beginning of the program: 16895 16896@example 16897# ignore case 16898@{ $0 = tolower($0) @} 16899@end example 16900 16901@noindent 16902Also, verify that all regexp and string constants used in 16903comparisons use only lowercase letters. 16904 16905@menu 16906* Library Names:: How to best name private global variables in 16907 library functions. 16908* General Functions:: Functions that are of general use. 16909* Data File Management:: Functions for managing command-line data 16910 files. 16911* Getopt Function:: A function for processing command-line 16912 arguments. 16913* Passwd Functions:: Functions for getting user information. 16914* Group Functions:: Functions for getting group information. 16915@end menu 16916 16917@node Library Names 16918@section Naming Library Function Global Variables 16919 16920@cindex names, arrays/variables 16921@cindex names, functions 16922@cindex namespace issues 16923@cindex @command{awk} programs, documenting 16924@cindex documentation, of @command{awk} programs 16925Due to the way the @command{awk} language evolved, variables are either 16926@dfn{global} (usable by the entire program) or @dfn{local} (usable just by 16927a specific function). There is no intermediate state analogous to 16928@code{static} variables in C. 16929 16930@cindex variables, global, for library functions 16931@cindex private variables 16932@cindex variables, private 16933Library functions often need to have global variables that they can use to 16934preserve state information between calls to the function---for example, 16935@code{getopt}'s variable @code{_opti} 16936(@pxref{Getopt Function}). 16937Such variables are called @dfn{private}, since the only functions that need to 16938use them are the ones in the library. 16939 16940When writing a library function, you should try to choose names for your 16941private variables that will not conflict with any variables used by 16942either another library function or a user's main program. For example, a 16943name like @samp{i} or @samp{j} is not a good choice, because user programs 16944often use variable names like these for their own purposes. 16945 16946@cindex programming conventions, private variable names 16947The example programs shown in this @value{CHAPTER} all start the names of their 16948private variables with an underscore (@samp{_}). Users generally don't use 16949leading underscores in their variable names, so this convention immediately 16950decreases the chances that the variable name will be accidentally shared 16951with the user's program. 16952 16953@cindex @code{_} (underscore), in names of private variables 16954@cindex underscore (@code{_}), in names of private variables 16955In addition, several of the library functions use a prefix that helps 16956indicate what function or set of functions use the variables---for example, 16957@code{_pw_byname} in the user database routines 16958(@pxref{Passwd Functions}). 16959This convention is recommended, since it even further decreases the 16960chance of inadvertent conflict among variable names. Note that this 16961convention is used equally well for variable names and for private 16962function names as well.@footnote{While all the library routines could have 16963been rewritten to use this convention, this was not done, in order to 16964show how my own @command{awk} programming style has evolved and to 16965provide some basis for this discussion.} 16966 16967As a final note on variable naming, if a function makes global variables 16968available for use by a main program, it is a good convention to start that 16969variable's name with a capital letter---for 16970example, @code{getopt}'s @code{Opterr} and @code{Optind} variables 16971(@pxref{Getopt Function}). 16972The leading capital letter indicates that it is global, while the fact that 16973the variable name is not all capital letters indicates that the variable is 16974not one of @command{awk}'s built-in variables, such as @code{FS}. 16975 16976@cindex @code{--dump-variables} option 16977It is also important that @emph{all} variables in library 16978functions that do not need to save state are, in fact, declared 16979local.@footnote{@command{gawk}'s @option{--dump-variables} command-line 16980option is useful for verifying this.} If this is not done, the variable 16981could accidentally be used in the user's program, leading to bugs that 16982are very difficult to track down: 16983 16984@example 16985function lib_func(x, y, l1, l2) 16986@{ 16987 @dots{} 16988 @var{use variable} some_var # some_var should be local 16989 @dots{} # but is not by oversight 16990@} 16991@end example 16992 16993@cindex arrays, associative, library functions and 16994@cindex libraries of @command{awk} functions, associative arrays and 16995@cindex functions, library, associative arrays and 16996@cindex Tcl 16997A different convention, common in the Tcl community, is to use a single 16998associative array to hold the values needed by the library function(s), or 16999``package.'' This significantly decreases the number of actual global names 17000in use. For example, the functions described in 17001@ref{Passwd Functions}, 17002might have used array elements @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}}, 17003@code{@w{PW_data["count"]}}, and @code{@w{PW_data["awklib"]}}, instead of 17004@code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}}, 17005and @code{@w{_pw_count}}. 17006 17007The conventions presented in this @value{SECTION} are exactly 17008that: conventions. You are not required to write your programs this 17009way---we merely recommend that you do so. 17010 17011@node General Functions 17012@section General Programming 17013 17014This @value{SECTION} presents a number of functions that are of general 17015programming use. 17016 17017@menu 17018* Nextfile Function:: Two implementations of a @code{nextfile} 17019 function. 17020* Assert Function:: A function for assertions in @command{awk} 17021 programs. 17022* Round Function:: A function for rounding if @code{sprintf} does 17023 not do it correctly. 17024* Cliff Random Function:: The Cliff Random Number Generator. 17025* Ordinal Functions:: Functions for using characters as numbers and 17026 vice versa. 17027* Join Function:: A function to join an array into a string. 17028* Gettimeofday Function:: A function to get formatted times. 17029@end menu 17030 17031@node Nextfile Function 17032@subsection Implementing @code{nextfile} as a Function 17033 17034@cindex input files, skipping 17035@c STARTOFRANGE libfnex 17036@cindex libraries of @command{awk} functions, @code{nextfile} statement 17037@c STARTOFRANGE flibnex 17038@cindex functions, library, @code{nextfile} statement 17039@c STARTOFRANGE nexim 17040@cindex @code{nextfile} statement, implementing 17041@cindex @command{gawk}, @code{nextfile} statement in 17042The @code{nextfile} statement, presented in 17043@ref{Nextfile Statement}, 17044is a @command{gawk}-specific extension---it is not available in most other 17045implementations of @command{awk}. This @value{SECTION} shows two versions of a 17046@code{nextfile} function that you can use to simulate @command{gawk}'s 17047@code{nextfile} statement if you cannot use @command{gawk}. 17048 17049A first attempt at writing a @code{nextfile} function is as follows: 17050 17051@example 17052# nextfile --- skip remaining records in current file 17053# this should be read in before the "main" awk program 17054 17055function nextfile() @{ _abandon_ = FILENAME; next @} 17056_abandon_ == FILENAME @{ next @} 17057@end example 17058 17059@cindex programming conventions, @code{nextfile} statement 17060Because it supplies a rule that must be executed first, this file should 17061be included before the main program. This rule compares the current 17062@value{DF}'s name (which is always in the @code{FILENAME} variable) to 17063a private variable named @code{_abandon_}. If the @value{FN} matches, 17064then the action part of the rule executes a @code{next} statement to 17065go on to the next record. (The use of @samp{_} in the variable name is 17066a convention. It is discussed more fully in 17067@ref{Library Names}.) 17068 17069The use of the @code{next} statement effectively creates a loop that reads 17070all the records from the current @value{DF}. 17071The end of the file is eventually reached and 17072a new @value{DF} is opened, changing the value of @code{FILENAME}. 17073Once this happens, the comparison of @code{_abandon_} to @code{FILENAME} 17074fails, and execution continues with the first rule of the ``real'' program. 17075 17076The @code{nextfile} function itself simply sets the value of @code{_abandon_} 17077and then executes a @code{next} statement to start the 17078loop. 17079@ignore 17080@c If the function can't be used on other versions of awk, this whole 17081@c section is pointless, no? Sigh. 17082@footnote{@command{gawk} is the only known @command{awk} implementation 17083that allows you to 17084execute @code{next} from within a function body. Some other workaround 17085is necessary if you are not using @command{gawk}.} 17086@end ignore 17087 17088@cindex @code{nextfile} user-defined function 17089This initial version has a subtle problem. 17090If the same @value{DF} is listed @emph{twice} on the commandline, 17091one right after the other 17092or even with just a variable assignment between them, 17093this code skips right through the file a second time, even though 17094it should stop when it gets to the end of the first occurrence. 17095A second version of @code{nextfile} that remedies this problem 17096is shown here: 17097 17098@example 17099@c file eg/lib/nextfile.awk 17100# nextfile --- skip remaining records in current file 17101# correctly handle successive occurrences of the same file 17102@c endfile 17103@ignore 17104@c file eg/lib/nextfile.awk 17105# 17106# Arnold Robbins, arnold@@gnu.org, Public Domain 17107# May, 1993 17108 17109@c endfile 17110@end ignore 17111@c file eg/lib/nextfile.awk 17112# this should be read in before the "main" awk program 17113 17114function nextfile() @{ _abandon_ = FILENAME; next @} 17115 17116_abandon_ == FILENAME @{ 17117 if (FNR == 1) 17118 _abandon_ = "" 17119 else 17120 next 17121@} 17122@c endfile 17123@end example 17124 17125The @code{nextfile} function has not changed. It makes @code{_abandon_} 17126equal to the current @value{FN} and then executes a @code{next} statement. 17127The @code{next} statement reads the next record and increments @code{FNR} 17128so that @code{FNR} is guaranteed to have a value of at least two. 17129However, if @code{nextfile} is called for the last record in the file, 17130then @command{awk} closes the current @value{DF} and moves on to the next 17131one. Upon doing so, @code{FILENAME} is set to the name of the new file 17132and @code{FNR} is reset to one. If this next file is the same as 17133the previous one, @code{_abandon_} is still equal to @code{FILENAME}. 17134However, @code{FNR} is equal to one, telling us that this is a new 17135occurrence of the file and not the one we were reading when the 17136@code{nextfile} function was executed. In that case, @code{_abandon_} 17137is reset to the empty string, so that further executions of this rule 17138fail (until the next time that @code{nextfile} is called). 17139 17140If @code{FNR} is not one, then we are still in the original @value{DF} 17141and the program executes a @code{next} statement to skip through it. 17142 17143An important question to ask at this point is: given that the 17144functionality of @code{nextfile} can be provided with a library file, 17145why is it built into @command{gawk}? Adding 17146features for little reason leads to larger, slower programs that are 17147harder to maintain. 17148The answer is that building @code{nextfile} into @command{gawk} provides 17149significant gains in efficiency. If the @code{nextfile} function is executed 17150at the beginning of a large @value{DF}, @command{awk} still has to scan the entire 17151file, splitting it up into records, 17152@c at least conceptually 17153just to skip over it. The built-in 17154@code{nextfile} can simply close the file immediately and proceed to the 17155next one, which saves a lot of time. This is particularly important in 17156@command{awk}, because @command{awk} programs are generally I/O-bound (i.e., 17157they spend most of their time doing input and output, instead of performing 17158computations). 17159@c ENDOFRANGE libfnex 17160@c ENDOFRANGE flibnex 17161@c ENDOFRANGE nexim 17162 17163@node Assert Function 17164@subsection Assertions 17165 17166@c STARTOFRANGE asse 17167@cindex assertions 17168@c STARTOFRANGE assef 17169@cindex @code{assert} function (C library) 17170@c STARTOFRANGE libfass 17171@cindex libraries of @command{awk} functions, assertions 17172@c STARTOFRANGE flibass 17173@cindex functions, library, assertions 17174@cindex @command{awk} programs, lengthy, assertions 17175When writing large programs, it is often useful to know 17176that a condition or set of conditions is true. Before proceeding with a 17177particular computation, you make a statement about what you believe to be 17178the case. Such a statement is known as an 17179@dfn{assertion}. The C language provides an @code{<assert.h>} header file 17180and corresponding @code{assert} macro that the programmer can use to make 17181assertions. If an assertion fails, the @code{assert} macro arranges to 17182print a diagnostic message describing the condition that should have 17183been true but was not, and then it kills the program. In C, using 17184@code{assert} looks this: 17185 17186@example 17187#include <assert.h> 17188 17189int myfunc(int a, double b) 17190@{ 17191 assert(a <= 5 && b >= 17.1); 17192 @dots{} 17193@} 17194@end example 17195 17196If the assertion fails, the program prints a message similar to this: 17197 17198@example 17199prog.c:5: assertion failed: a <= 5 && b >= 17.1 17200@end example 17201 17202@cindex @code{assert} user-defined function 17203The C language makes it possible to turn the condition into a string for use 17204in printing the diagnostic message. This is not possible in @command{awk}, so 17205this @code{assert} function also requires a string version of the condition 17206that is being tested. 17207Following is the function: 17208 17209@example 17210@c file eg/lib/assert.awk 17211# assert --- assert that a condition is true. Otherwise exit. 17212@c endfile 17213@ignore 17214@c file eg/lib/assert.awk 17215 17216# 17217# Arnold Robbins, arnold@@gnu.org, Public Domain 17218# May, 1993 17219 17220@c endfile 17221@end ignore 17222@c file eg/lib/assert.awk 17223function assert(condition, string) 17224@{ 17225 if (! condition) @{ 17226 printf("%s:%d: assertion failed: %s\n", 17227 FILENAME, FNR, string) > "/dev/stderr" 17228 _assert_exit = 1 17229 exit 1 17230 @} 17231@} 17232 17233@group 17234END @{ 17235 if (_assert_exit) 17236 exit 1 17237@} 17238@end group 17239@c endfile 17240@end example 17241 17242The @code{assert} function tests the @code{condition} parameter. If it 17243is false, it prints a message to standard error, using the @code{string} 17244parameter to describe the failed condition. It then sets the variable 17245@code{_assert_exit} to one and executes the @code{exit} statement. 17246The @code{exit} statement jumps to the @code{END} rule. If the @code{END} 17247rules finds @code{_assert_exit} to be true, it exits immediately. 17248 17249The purpose of the test in the @code{END} rule is to 17250keep any other @code{END} rules from running. When an assertion fails, the 17251program should exit immediately. 17252If no assertions fail, then @code{_assert_exit} is still 17253false when the @code{END} rule is run normally, and the rest of the 17254program's @code{END} rules execute. 17255For all of this to work correctly, @file{assert.awk} must be the 17256first source file read by @command{awk}. 17257The function can be used in a program in the following way: 17258 17259@example 17260function myfunc(a, b) 17261@{ 17262 assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1") 17263 @dots{} 17264@} 17265@end example 17266 17267@noindent 17268If the assertion fails, you see a message similar to the following: 17269 17270@example 17271mydata:1357: assertion failed: a <= 5 && b >= 17.1 17272@end example 17273 17274@cindex @code{END} pattern, @code{assert} user-defined function and 17275There is a small problem with this version of @code{assert}. 17276An @code{END} rule is automatically added 17277to the program calling @code{assert}. Normally, if a program consists 17278of just a @code{BEGIN} rule, the input files and/or standard input are 17279not read. However, now that the program has an @code{END} rule, @command{awk} 17280attempts to read the input @value{DF}s or standard input 17281(@pxref{Using BEGIN/END}), 17282most likely causing the program to hang as it waits for input. 17283 17284@cindex @code{BEGIN} pattern, @code{assert} user-defined function and 17285There is a simple workaround to this: 17286make sure the @code{BEGIN} rule always ends 17287with an @code{exit} statement. 17288@c ENDOFRANGE asse 17289@c ENDOFRANGE assef 17290@c ENDOFRANGE flibass 17291@c ENDOFRANGE libfass 17292 17293@node Round Function 17294@subsection Rounding Numbers 17295 17296@cindex rounding 17297@cindex rounding numbers 17298@cindex numbers, rounding 17299@cindex libraries of @command{awk} functions, rounding numbers 17300@cindex functions, library, rounding numbers 17301@cindex @code{print} statement, @code{sprintf} function and 17302@cindex @code{printf} statement, @code{sprintf} function and 17303@cindex @code{sprintf} function, @code{print}/@code{printf} statements and 17304The way @code{printf} and @code{sprintf} 17305(@pxref{Printf}) 17306perform rounding often depends upon the system's C @code{sprintf} 17307subroutine. On many machines, @code{sprintf} rounding is ``unbiased,'' 17308which means it doesn't always round a trailing @samp{.5} up, contrary 17309to naive expectations. In unbiased rounding, @samp{.5} rounds to even, 17310rather than always up, so 1.5 rounds to 2 but 4.5 rounds to 4. This means 17311that if you are using a format that does rounding (e.g., @code{"%.0f"}), 17312you should check what your system does. The following function does 17313traditional rounding; it might be useful if your awk's @code{printf} 17314does unbiased rounding: 17315 17316@cindex @code{round} user-defined function 17317@example 17318@c file eg/lib/round.awk 17319# round.awk --- do normal rounding 17320@c endfile 17321@ignore 17322@c file eg/lib/round.awk 17323# 17324# Arnold Robbins, arnold@@gnu.org, Public Domain 17325# August, 1996 17326 17327@c endfile 17328@end ignore 17329@c file eg/lib/round.awk 17330function round(x, ival, aval, fraction) 17331@{ 17332 ival = int(x) # integer part, int() truncates 17333 17334 # see if fractional part 17335 if (ival == x) # no fraction 17336 return x 17337 17338 if (x < 0) @{ 17339 aval = -x # absolute value 17340 ival = int(aval) 17341 fraction = aval - ival 17342 if (fraction >= .5) 17343 return int(x) - 1 # -2.5 --> -3 17344 else 17345 return int(x) # -2.3 --> -2 17346 @} else @{ 17347 fraction = x - ival 17348 if (fraction >= .5) 17349 return ival + 1 17350 else 17351 return ival 17352 @} 17353@} 17354 17355# test harness 17356@{ print $0, round($0) @} 17357@c endfile 17358@end example 17359 17360@node Cliff Random Function 17361@subsection The Cliff Random Number Generator 17362@cindex random numbers, Cliff 17363@cindex Cliff random numbers 17364@cindex numbers, Cliff random 17365@cindex functions, library, Cliff random numbers 17366 17367The Cliff random number 17368generator@footnote{@uref{http://mathworld.wolfram.com/CliffRandomNumberGenerator.hmtl}} 17369is a very simple random number generator that ``passes the noise sphere test 17370for randomness by showing no structure.'' 17371It is easily programmed, in less than 10 lines of @command{awk} code: 17372 17373@cindex @code{cliff_rand} user-defined function 17374@example 17375@c file eg/lib/cliff_rand.awk 17376# cliff_rand.awk --- generate Cliff random numbers 17377@c endfile 17378@ignore 17379@c file eg/lib/cliff_rand.awk 17380# 17381# Arnold Robbins, arnold@@gnu.org, Public Domain 17382# December 2000 17383 17384@c endfile 17385@end ignore 17386@c file eg/lib/cliff_rand.awk 17387BEGIN @{ _cliff_seed = 0.1 @} 17388 17389function cliff_rand() 17390@{ 17391 _cliff_seed = (100 * log(_cliff_seed)) % 1 17392 if (_cliff_seed < 0) 17393 _cliff_seed = - _cliff_seed 17394 return _cliff_seed 17395@} 17396@c endfile 17397@end example 17398 17399This algorithm requires an initial ``seed'' of 0.1. Each new value 17400uses the current seed as input for the calculation. 17401If the built-in @code{rand} function 17402(@pxref{Numeric Functions}) 17403isn't random enough, you might try using this function instead. 17404 17405@node Ordinal Functions 17406@subsection Translating Between Characters and Numbers 17407 17408@cindex libraries of @command{awk} functions, character values as numbers 17409@cindex functions, library, character values as numbers 17410@cindex characters, values of as numbers 17411@cindex numbers, as values of characters 17412One commercial implementation of @command{awk} supplies a built-in function, 17413@code{ord}, which takes a character and returns the numeric value for that 17414character in the machine's character set. If the string passed to 17415@code{ord} has more than one character, only the first one is used. 17416 17417The inverse of this function is @code{chr} (from the function of the same 17418name in Pascal), which takes a number and returns the corresponding character. 17419Both functions are written very nicely in @command{awk}; there is no real 17420reason to build them into the @command{awk} interpreter: 17421 17422@cindex @code{ord} user-defined function 17423@cindex @code{chr} user-defined function 17424@example 17425@c file eg/lib/ord.awk 17426# ord.awk --- do ord and chr 17427 17428# Global identifiers: 17429# _ord_: numerical values indexed by characters 17430# _ord_init: function to initialize _ord_ 17431@c endfile 17432@ignore 17433@c file eg/lib/ord.awk 17434# 17435# Arnold Robbins, arnold@@gnu.org, Public Domain 17436# 16 January, 1992 17437# 20 July, 1992, revised 17438 17439@c endfile 17440@end ignore 17441@c file eg/lib/ord.awk 17442BEGIN @{ _ord_init() @} 17443 17444function _ord_init( low, high, i, t) 17445@{ 17446 low = sprintf("%c", 7) # BEL is ascii 7 17447 if (low == "\a") @{ # regular ascii 17448 low = 0 17449 high = 127 17450 @} else if (sprintf("%c", 128 + 7) == "\a") @{ 17451 # ascii, mark parity 17452 low = 128 17453 high = 255 17454 @} else @{ # ebcdic(!) 17455 low = 0 17456 high = 255 17457 @} 17458 17459 for (i = low; i <= high; i++) @{ 17460 t = sprintf("%c", i) 17461 _ord_[t] = i 17462 @} 17463@} 17464@c endfile 17465@end example 17466 17467@cindex character sets 17468@cindex character encodings 17469@cindex ASCII 17470@cindex EBCDIC 17471@cindex mark parity 17472Some explanation of the numbers used by @code{chr} is worthwhile. 17473The most prominent character set in use today is ASCII. Although an 174748-bit byte can hold 256 distinct values (from 0 to 255), ASCII only 17475defines characters that use the values from 0 to 127.@footnote{ASCII 17476has been extended in many countries to use the values from 128 to 255 17477for country-specific characters. If your system uses these extensions, 17478you can simplify @code{_ord_init} to simply loop from 0 to 255.} 17479In the now distant past, 17480at least one minicomputer manufacturer 17481@c Pr1me, blech 17482used ASCII, but with mark parity, meaning that the leftmost bit in the byte 17483is always 1. This means that on those systems, characters 17484have numeric values from 128 to 255. 17485Finally, large mainframe systems use the EBCDIC character set, which 17486uses all 256 values. 17487While there are other character sets in use on some older systems, 17488they are not really worth worrying about: 17489 17490@example 17491@c file eg/lib/ord.awk 17492function ord(str, c) 17493@{ 17494 # only first character is of interest 17495 c = substr(str, 1, 1) 17496 return _ord_[c] 17497@} 17498 17499function chr(c) 17500@{ 17501 # force c to be numeric by adding 0 17502 return sprintf("%c", c + 0) 17503@} 17504@c endfile 17505 17506#### test code #### 17507# BEGIN \ 17508# @{ 17509# for (;;) @{ 17510# printf("enter a character: ") 17511# if (getline var <= 0) 17512# break 17513# printf("ord(%s) = %d\n", var, ord(var)) 17514# @} 17515# @} 17516@c endfile 17517@end example 17518 17519An obvious improvement to these functions is to move the code for the 17520@code{@w{_ord_init}} function into the body of the @code{BEGIN} rule. It was 17521written this way initially for ease of development. 17522There is a ``test program'' in a @code{BEGIN} rule, to test the 17523function. It is commented out for production use. 17524 17525@node Join Function 17526@subsection Merging an Array into a String 17527 17528@cindex libraries of @command{awk} functions, merging arrays into strings 17529@cindex functions, library, merging arrays into strings 17530@cindex strings, merging arrays into 17531@cindex arrays, merging into strings 17532When doing string processing, it is often useful to be able to join 17533all the strings in an array into one long string. The following function, 17534@code{join}, accomplishes this task. It is used later in several of 17535the application programs 17536(@pxref{Sample Programs}). 17537 17538Good function design is important; this function needs to be general but it 17539should also have a reasonable default behavior. It is called with an array 17540as well as the beginning and ending indices of the elements in the array to be 17541merged. This assumes that the array indices are numeric---a reasonable 17542assumption since the array was likely created with @code{split} 17543(@pxref{String Functions}): 17544 17545@cindex @code{join} user-defined function 17546@example 17547@c file eg/lib/join.awk 17548# join.awk --- join an array into a string 17549@c endfile 17550@ignore 17551@c file eg/lib/join.awk 17552# 17553# Arnold Robbins, arnold@@gnu.org, Public Domain 17554# May 1993 17555 17556@c endfile 17557@end ignore 17558@c file eg/lib/join.awk 17559function join(array, start, end, sep, result, i) 17560@{ 17561 if (sep == "") 17562 sep = " " 17563 else if (sep == SUBSEP) # magic value 17564 sep = "" 17565 result = array[start] 17566 for (i = start + 1; i <= end; i++) 17567 result = result sep array[i] 17568 return result 17569@} 17570@c endfile 17571@end example 17572 17573An optional additional argument is the separator to use when joining the 17574strings back together. If the caller supplies a nonempty value, 17575@code{join} uses it; if it is not supplied, it has a null 17576value. In this case, @code{join} uses a single blank as a default 17577separator for the strings. If the value is equal to @code{SUBSEP}, 17578then @code{join} joins the strings with no separator between them. 17579@code{SUBSEP} serves as a ``magic'' value to indicate that there should 17580be no separation between the component strings.@footnote{It would 17581be nice if @command{awk} had an assignment operator for concatenation. 17582The lack of an explicit operator for concatenation makes string operations 17583more difficult than they really need to be.} 17584 17585@node Gettimeofday Function 17586@subsection Managing the Time of Day 17587 17588@cindex libraries of @command{awk} functions, managing, time 17589@cindex functions, library, managing time 17590@cindex timestamps, formatted 17591@cindex time, managing 17592The @code{systime} and @code{strftime} functions described in 17593@ref{Time Functions}, 17594provide the minimum functionality necessary for dealing with the time of day 17595in human readable form. While @code{strftime} is extensive, the control 17596formats are not necessarily easy to remember or intuitively obvious when 17597reading a program. 17598 17599The following function, @code{gettimeofday}, populates a user-supplied array 17600with preformatted time information. It returns a string with the current 17601time formatted in the same way as the @command{date} utility: 17602 17603@cindex @code{gettimeofday} user-defined function 17604@example 17605@c file eg/lib/gettime.awk 17606# gettimeofday.awk --- get the time of day in a usable format 17607@c endfile 17608@ignore 17609@c file eg/lib/gettime.awk 17610# 17611# Arnold Robbins, arnold@@gnu.org, Public Domain, May 1993 17612# 17613@c endfile 17614@end ignore 17615@c file eg/lib/gettime.awk 17616 17617# Returns a string in the format of output of date(1) 17618# Populates the array argument time with individual values: 17619# time["second"] -- seconds (0 - 59) 17620# time["minute"] -- minutes (0 - 59) 17621# time["hour"] -- hours (0 - 23) 17622# time["althour"] -- hours (0 - 12) 17623# time["monthday"] -- day of month (1 - 31) 17624# time["month"] -- month of year (1 - 12) 17625# time["monthname"] -- name of the month 17626# time["shortmonth"] -- short name of the month 17627# time["year"] -- year modulo 100 (0 - 99) 17628# time["fullyear"] -- full year 17629# time["weekday"] -- day of week (Sunday = 0) 17630# time["altweekday"] -- day of week (Monday = 0) 17631# time["dayname"] -- name of weekday 17632# time["shortdayname"] -- short name of weekday 17633# time["yearday"] -- day of year (0 - 365) 17634# time["timezone"] -- abbreviation of timezone name 17635# time["ampm"] -- AM or PM designation 17636# time["weeknum"] -- week number, Sunday first day 17637# time["altweeknum"] -- week number, Monday first day 17638 17639function gettimeofday(time, ret, now, i) 17640@{ 17641 # get time once, avoids unnecessary system calls 17642 now = systime() 17643 17644 # return date(1)-style output 17645 ret = strftime("%a %b %d %H:%M:%S %Z %Y", now) 17646 17647 # clear out target array 17648 delete time 17649 17650 # fill in values, force numeric values to be 17651 # numeric by adding 0 17652 time["second"] = strftime("%S", now) + 0 17653 time["minute"] = strftime("%M", now) + 0 17654 time["hour"] = strftime("%H", now) + 0 17655 time["althour"] = strftime("%I", now) + 0 17656 time["monthday"] = strftime("%d", now) + 0 17657 time["month"] = strftime("%m", now) + 0 17658 time["monthname"] = strftime("%B", now) 17659 time["shortmonth"] = strftime("%b", now) 17660 time["year"] = strftime("%y", now) + 0 17661 time["fullyear"] = strftime("%Y", now) + 0 17662 time["weekday"] = strftime("%w", now) + 0 17663 time["altweekday"] = strftime("%u", now) + 0 17664 time["dayname"] = strftime("%A", now) 17665 time["shortdayname"] = strftime("%a", now) 17666 time["yearday"] = strftime("%j", now) + 0 17667 time["timezone"] = strftime("%Z", now) 17668 time["ampm"] = strftime("%p", now) 17669 time["weeknum"] = strftime("%U", now) + 0 17670 time["altweeknum"] = strftime("%W", now) + 0 17671 17672 return ret 17673@} 17674@c endfile 17675@end example 17676 17677The string indices are easier to use and read than the various formats 17678required by @code{strftime}. The @code{alarm} program presented in 17679@ref{Alarm Program}, 17680uses this function. 17681A more general design for the @code{gettimeofday} function would have 17682allowed the user to supply an optional timestamp value to use instead 17683of the current time. 17684 17685@node Data File Management 17686@section @value{DDF} Management 17687 17688@c STARTOFRANGE dataf 17689@cindex files, managing 17690@c STARTOFRANGE libfdataf 17691@cindex libraries of @command{awk} functions, managing, @value{DF}s 17692@c STARTOFRANGE flibdataf 17693@cindex functions, library, managing @value{DF}s 17694This @value{SECTION} presents functions that are useful for managing 17695command-line @value{DF}s. 17696 17697@menu 17698* Filetrans Function:: A function for handling data file transitions. 17699* Rewind Function:: A function for rereading the current file. 17700* File Checking:: Checking that data files are readable. 17701* Empty Files:: Checking for zero-length files. 17702* Ignoring Assigns:: Treating assignments as file names. 17703@end menu 17704 17705@node Filetrans Function 17706@subsection Noting @value{DDF} Boundaries 17707 17708@cindex files, managing, @value{DF} boundaries 17709@cindex files, initialization and cleanup 17710The @code{BEGIN} and @code{END} rules are each executed exactly once at 17711the beginning and end of your @command{awk} program, respectively 17712(@pxref{BEGIN/END}). 17713We (the @command{gawk} authors) once had a user who mistakenly thought that the 17714@code{BEGIN} rule is executed at the beginning of each @value{DF} and the 17715@code{END} rule is executed at the end of each @value{DF}. When informed 17716that this was not the case, the user requested that we add new special 17717patterns to @command{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that 17718would have the desired behavior. He even supplied us the code to do so. 17719 17720Adding these special patterns to @command{gawk} wasn't necessary; 17721the job can be done cleanly in @command{awk} itself, as illustrated 17722by the following library program. 17723It arranges to call two user-supplied functions, @code{beginfile} and 17724@code{endfile}, at the beginning and end of each @value{DF}. 17725Besides solving the problem in only nine(!) lines of code, it does so 17726@emph{portably}; this works with any implementation of @command{awk}: 17727 17728@example 17729# transfile.awk 17730# 17731# Give the user a hook for filename transitions 17732# 17733# The user must supply functions beginfile() and endfile() 17734# that each take the name of the file being started or 17735# finished, respectively. 17736@c # 17737@c # Arnold Robbins, arnold@@gnu.org, Public Domain 17738@c # January 1992 17739 17740FILENAME != _oldfilename \ 17741@{ 17742 if (_oldfilename != "") 17743 endfile(_oldfilename) 17744 _oldfilename = FILENAME 17745 beginfile(FILENAME) 17746@} 17747 17748END @{ endfile(FILENAME) @} 17749@end example 17750 17751This file must be loaded before the user's ``main'' program, so that the 17752rule it supplies is executed first. 17753 17754This rule relies on @command{awk}'s @code{FILENAME} variable that 17755automatically changes for each new @value{DF}. The current @value{FN} is 17756saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does 17757not equal @code{_oldfilename}, then a new @value{DF} is being processed and 17758it is necessary to call @code{endfile} for the old file. Because 17759@code{endfile} should only be called if a file has been processed, the 17760program first checks to make sure that @code{_oldfilename} is not the null 17761string. The program then assigns the current @value{FN} to 17762@code{_oldfilename} and calls @code{beginfile} for the file. 17763Because, like all @command{awk} variables, @code{_oldfilename} is 17764initialized to the null string, this rule executes correctly even for the 17765first @value{DF}. 17766 17767The program also supplies an @code{END} rule to do the final processing for 17768the last file. Because this @code{END} rule comes before any @code{END} rules 17769supplied in the ``main'' program, @code{endfile} is called first. Once 17770again the value of multiple @code{BEGIN} and @code{END} rules should be clear. 17771 17772@cindex @code{beginfile} user-defined function 17773@cindex @code{endfile} user-defined function 17774This version has same problem as the first version of @code{nextfile} 17775(@pxref{Nextfile Function}). 17776If the same @value{DF} occurs twice in a row on the command line, then 17777@code{endfile} and @code{beginfile} are not executed at the end of the 17778first pass and at the beginning of the second pass. 17779The following version solves the problem: 17780 17781@example 17782@c file eg/lib/ftrans.awk 17783# ftrans.awk --- handle data file transitions 17784# 17785# user supplies beginfile() and endfile() functions 17786@c endfile 17787@ignore 17788@c file eg/lib/ftrans.awk 17789# 17790# Arnold Robbins, arnold@@gnu.org, Public Domain 17791# November 1992 17792 17793@c endfile 17794@end ignore 17795@c file eg/lib/ftrans.awk 17796FNR == 1 @{ 17797 if (_filename_ != "") 17798 endfile(_filename_) 17799 _filename_ = FILENAME 17800 beginfile(FILENAME) 17801@} 17802 17803END @{ endfile(_filename_) @} 17804@c endfile 17805@end example 17806 17807@ref{Wc Program}, 17808shows how this library function can be used and 17809how it simplifies writing the main program. 17810 17811@node Rewind Function 17812@subsection Rereading the Current File 17813 17814@cindex files, reading 17815Another request for a new built-in function was for a @code{rewind} 17816function that would make it possible to reread the current file. 17817The requesting user didn't want to have to use @code{getline} 17818(@pxref{Getline}) 17819inside a loop. 17820 17821However, as long as you are not in the @code{END} rule, it is 17822quite easy to arrange to immediately close the current input file 17823and then start over with it from the top. 17824For lack of a better name, we'll call it @code{rewind}: 17825 17826@cindex @code{rewind} user-defined function 17827@example 17828@c file eg/lib/rewind.awk 17829# rewind.awk --- rewind the current file and start over 17830@c endfile 17831@ignore 17832@c file eg/lib/rewind.awk 17833# 17834# Arnold Robbins, arnold@@gnu.org, Public Domain 17835# September 2000 17836 17837@c endfile 17838@end ignore 17839@c file eg/lib/rewind.awk 17840function rewind( i) 17841@{ 17842 # shift remaining arguments up 17843 for (i = ARGC; i > ARGIND; i--) 17844 ARGV[i] = ARGV[i-1] 17845 17846 # make sure gawk knows to keep going 17847 ARGC++ 17848 17849 # make current file next to get done 17850 ARGV[ARGIND+1] = FILENAME 17851 17852 # do it 17853 nextfile 17854@} 17855@c endfile 17856@end example 17857 17858This code relies on the @code{ARGIND} variable 17859(@pxref{Auto-set}), 17860which is specific to @command{gawk}. 17861If you are not using 17862@command{gawk}, you can use ideas presented in 17863@ifnotinfo 17864the previous @value{SECTION} 17865@end ifnotinfo 17866@ifinfo 17867@ref{Filetrans Function}, 17868@end ifinfo 17869to either update @code{ARGIND} on your own 17870or modify this code as appropriate. 17871 17872The @code{rewind} function also relies on the @code{nextfile} keyword 17873(@pxref{Nextfile Statement}). 17874@xref{Nextfile Function}, 17875for a function version of @code{nextfile}. 17876 17877@node File Checking 17878@subsection Checking for Readable @value{DDF}s 17879 17880@cindex troubleshooting, readable @value{DF}s 17881@c comma is part of primary 17882@cindex readable @value{DF}s, checking 17883@cindex files, skipping 17884Normally, if you give @command{awk} a @value{DF} that isn't readable, 17885it stops with a fatal error. There are times when you 17886might want to just ignore such files and keep going. You can 17887do this by prepending the following program to your @command{awk} 17888program: 17889 17890@cindex @code{readable.awk} program 17891@example 17892@c file eg/lib/readable.awk 17893# readable.awk --- library file to skip over unreadable files 17894@c endfile 17895@ignore 17896@c file eg/lib/readable.awk 17897# 17898# Arnold Robbins, arnold@@gnu.org, Public Domain 17899# October 2000 17900 17901@c endfile 17902@end ignore 17903@c file eg/lib/readable.awk 17904BEGIN @{ 17905 for (i = 1; i < ARGC; i++) @{ 17906 if (ARGV[i] ~ /^[A-Za-z_][A-Za-z0-9_]*=.*/ \ 17907 || ARGV[i] == "-") 17908 continue # assignment or standard input 17909 else if ((getline junk < ARGV[i]) < 0) # unreadable 17910 delete ARGV[i] 17911 else 17912 close(ARGV[i]) 17913 @} 17914@} 17915@c endfile 17916@end example 17917 17918@cindex troubleshooting, @code{getline} function 17919In @command{gawk}, the @code{getline} won't be fatal (unless 17920@option{--posix} is in force). 17921Removing the element from @code{ARGV} with @code{delete} 17922skips the file (since it's no longer in the list). 17923 17924@c This doesn't handle /dev/stdin etc. Not worth the hassle to mention or fix. 17925 17926@node Empty Files 17927@subsection Checking For Zero-length Files 17928 17929All known @command{awk} implementations silently skip over zero-length files. 17930This is a by-product of @command{awk}'s implicit 17931read-a-record-and-match-against-the-rules loop: when @command{awk} 17932tries to read a record from an empty file, it immediately receives an 17933end of file indication, closes the file, and proceeds on to the next 17934command-line @value{DF}, @emph{without} executing any user-level 17935@command{awk} program code. 17936 17937Using @command{gawk}'s @code{ARGIND} variable 17938(@pxref{Built-in Variables}), it is possible to detect when an empty 17939@value{DF} has been skipped. Similar to the library file presented 17940in @ref{Filetrans Function}, the following library file calls a function named 17941@code{zerofile} that the user must provide. The arguments passed are 17942the @value{FN} and the position in @code{ARGV} where it was found: 17943 17944@cindex @code{zerofile.awk} program 17945@example 17946@c file eg/lib/zerofile.awk 17947# zerofile.awk --- library file to process empty input files 17948@c endfile 17949@ignore 17950@c file eg/lib/zerofile.awk 17951# 17952# Arnold Robbins, arnold@@gnu.org, Public Domain 17953# June 2003 17954 17955@c endfile 17956@end ignore 17957@c file eg/lib/zerofile.awk 17958BEGIN @{ Argind = 0 @} 17959 17960ARGIND > Argind + 1 @{ 17961 for (Argind++; Argind < ARGIND; Argind++) 17962 zerofile(ARGV[Argind], Argind) 17963@} 17964 17965ARGIND != Argind @{ Argind = ARGIND @} 17966 17967END @{ 17968 if (ARGIND > Argind) 17969 for (Argind++; Argind <= ARGIND; Argind++) 17970 zerofile(ARGV[Argind], Argind) 17971@} 17972@c endfile 17973@end example 17974 17975The user-level variable @code{Argind} allows the @command{awk} program 17976to track its progress through @code{ARGV}. Whenever the program detects 17977that @code{ARGIND} is greater than @samp{Argind + 1}, it means that one or 17978more empty files were skipped. The action then calls @code{zerofile} for 17979each such file, incrementing @code{Argind} along the way. 17980 17981The @samp{Argind != ARGIND} rule simply keeps @code{Argind} up to date 17982in the normal case. 17983 17984Finally, the @code{END} rule catches the case of any empty files at 17985the end of the command-line arguments. Note that the test in the 17986condition of the @code{for} loop uses the @samp{<=} operator, 17987not @code{<}. 17988 17989As an exercise, you might consider whether this same problem can 17990be solved without relying on @command{gawk}'s @code{ARGIND} variable. 17991 17992As a second exercise, revise this code to handle the case where 17993an intervening value in @code{ARGV} is a variable assignment. 17994 17995@ignore 17996# zerofile2.awk --- same thing, portably 17997BEGIN @{ 17998 ARGIND = Argind = 0 17999 for (i = 1; i < ARGC; i++) 18000 Fnames[ARGV[i]]++ 18001 18002@} 18003FNR == 1 @{ 18004 while (ARGV[ARGIND] != FILENAME) 18005 ARGIND++ 18006 Seen[FILENAME]++ 18007 if (Seen[FILENAME] == Fnames[FILENAME]) 18008 do 18009 ARGIND++ 18010 while (ARGV[ARGIND] != FILENAME) 18011@} 18012ARGIND > Argind + 1 @{ 18013 for (Argind++; Argind < ARGIND; Argind++) 18014 zerofile(ARGV[Argind], Argind) 18015@} 18016ARGIND != Argind @{ 18017 Argind = ARGIND 18018@} 18019END @{ 18020 if (ARGIND < ARGC - 1) 18021 ARGIND = ARGC - 1 18022 if (ARGIND > Argind) 18023 for (Argind++; Argind <= ARGIND; Argind++) 18024 zerofile(ARGV[Argind], Argind) 18025@} 18026@end ignore 18027 18028@node Ignoring Assigns 18029@subsection Treating Assignments as @value{FFN}s 18030 18031@cindex assignments as filenames 18032@cindex filenames, assignments as 18033Occasionally, you might not want @command{awk} to process command-line 18034variable assignments 18035(@pxref{Assignment Options}). 18036In particular, if you have @value{FN}s that contain an @samp{=} character, 18037@command{awk} treats the @value{FN} as an assignment, and does not process it. 18038 18039Some users have suggested an additional command-line option for @command{gawk} 18040to disable command-line assignments. However, some simple programming with 18041a library file does the trick: 18042 18043@cindex @code{noassign.awk} program 18044@example 18045@c file eg/lib/noassign.awk 18046# noassign.awk --- library file to avoid the need for a 18047# special option that disables command-line assignments 18048@c endfile 18049@ignore 18050@c file eg/lib/noassign.awk 18051# 18052# Arnold Robbins, arnold@@gnu.org, Public Domain 18053# October 1999 18054 18055@c endfile 18056@end ignore 18057@c file eg/lib/noassign.awk 18058function disable_assigns(argc, argv, i) 18059@{ 18060 for (i = 1; i < argc; i++) 18061 if (argv[i] ~ /^[A-Za-z_][A-Za-z_0-9]*=.*/) 18062 argv[i] = ("./" argv[i]) 18063@} 18064 18065BEGIN @{ 18066 if (No_command_assign) 18067 disable_assigns(ARGC, ARGV) 18068@} 18069@c endfile 18070@end example 18071 18072You then run your program this way: 18073 18074@example 18075awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk * 18076@end example 18077 18078The function works by looping through the arguments. 18079It prepends @samp{./} to 18080any argument that matches the form 18081of a variable assignment, turning that argument into a @value{FN}. 18082 18083The use of @code{No_command_assign} allows you to disable command-line 18084assignments at invocation time, by giving the variable a true value. 18085When not set, it is initially zero (i.e., false), so the command-line arguments 18086are left alone. 18087@c ENDOFRANGE dataf 18088@c ENDOFRANGE flibdataf 18089@c ENDOFRANGE libfdataf 18090 18091@node Getopt Function 18092@section Processing Command-Line Options 18093 18094@c STARTOFRANGE libfclo 18095@cindex libraries of @command{awk} functions, command-line options 18096@c STARTOFRANGE flibclo 18097@cindex functions, library, command-line options 18098@c STARTOFRANGE clop 18099@cindex command-line options, processing 18100@c STARTOFRANGE oclp 18101@cindex options, command-line, processing 18102@c STARTOFRANGE clibf 18103@cindex functions, library, C library 18104@cindex arguments, processing 18105Most utilities on POSIX compatible systems take options, or ``switches,'' on 18106the command line that can be used to change the way a program behaves. 18107@command{awk} is an example of such a program 18108(@pxref{Options}). 18109Often, options take @dfn{arguments}; i.e., data that the program needs to 18110correctly obey the command-line option. For example, @command{awk}'s 18111@option{-F} option requires a string to use as the field separator. 18112The first occurrence on the command line of either @option{--} or a 18113string that does not begin with @samp{-} ends the options. 18114 18115@cindex @code{getopt} function (C library) 18116Modern Unix systems provide a C function named @code{getopt} for processing 18117command-line arguments. The programmer provides a string describing the 18118one-letter options. If an option requires an argument, it is followed in the 18119string with a colon. @code{getopt} is also passed the 18120count and values of the command-line arguments and is called in a loop. 18121@code{getopt} processes the command-line arguments for option letters. 18122Each time around the loop, it returns a single character representing the 18123next option letter that it finds, or @samp{?} if it finds an invalid option. 18124When it returns @minus{}1, there are no options left on the command line. 18125 18126When using @code{getopt}, options that do not take arguments can be 18127grouped together. Furthermore, options that take arguments require that the 18128argument is present. The argument can immediately follow the option letter, 18129or it can be a separate command-line argument. 18130 18131Given a hypothetical program that takes 18132three command-line options, @option{-a}, @option{-b}, and @option{-c}, where 18133@option{-b} requires an argument, all of the following are valid ways of 18134invoking the program: 18135 18136@example 18137prog -a -b foo -c data1 data2 data3 18138prog -ac -bfoo -- data1 data2 data3 18139prog -acbfoo data1 data2 data3 18140@end example 18141 18142Notice that when the argument is grouped with its option, the rest of 18143the argument is considered to be the option's argument. 18144In this example, @option{-acbfoo} indicates that all of the 18145@option{-a}, @option{-b}, and @option{-c} options were supplied, 18146and that @samp{foo} is the argument to the @option{-b} option. 18147 18148@code{getopt} provides four external variables that the programmer can use: 18149 18150@table @code 18151@item optind 18152The index in the argument value array (@code{argv}) where the first 18153nonoption command-line argument can be found. 18154 18155@item optarg 18156The string value of the argument to an option. 18157 18158@item opterr 18159Usually @code{getopt} prints an error message when it finds an invalid 18160option. Setting @code{opterr} to zero disables this feature. (An 18161application might want to print its own error message.) 18162 18163@item optopt 18164The letter representing the command-line option. 18165@c While not usually documented, most versions supply this variable. 18166@end table 18167 18168The following C fragment shows how @code{getopt} might process command-line 18169arguments for @command{awk}: 18170 18171@example 18172int 18173main(int argc, char *argv[]) 18174@{ 18175 @dots{} 18176 /* print our own message */ 18177 opterr = 0; 18178 while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{ 18179 switch (c) @{ 18180 case 'f': /* file */ 18181 @dots{} 18182 break; 18183 case 'F': /* field separator */ 18184 @dots{} 18185 break; 18186 case 'v': /* variable assignment */ 18187 @dots{} 18188 break; 18189 case 'W': /* extension */ 18190 @dots{} 18191 break; 18192 case '?': 18193 default: 18194 usage(); 18195 break; 18196 @} 18197 @} 18198 @dots{} 18199@} 18200@end example 18201 18202As a side point, @command{gawk} actually uses the GNU @code{getopt_long} 18203function to process both normal and GNU-style long options 18204(@pxref{Options}). 18205 18206The abstraction provided by @code{getopt} is very useful and is quite 18207handy in @command{awk} programs as well. Following is an @command{awk} 18208version of @code{getopt}. This function highlights one of the 18209greatest weaknesses in @command{awk}, which is that it is very poor at 18210manipulating single characters. Repeated calls to @code{substr} are 18211necessary for accessing individual characters 18212(@pxref{String Functions}).@footnote{This 18213function was written before @command{gawk} acquired the ability to 18214split strings into single characters using @code{""} as the separator. 18215We have left it alone, since using @code{substr} is more portable.} 18216 18217The discussion that follows walks through the code a bit at a time: 18218 18219@cindex @code{getopt} user-defined function 18220@example 18221@c file eg/lib/getopt.awk 18222# getopt.awk --- do C library getopt(3) function in awk 18223@c endfile 18224@ignore 18225@c file eg/lib/getopt.awk 18226# 18227# Arnold Robbins, arnold@@gnu.org, Public Domain 18228# 18229# Initial version: March, 1991 18230# Revised: May, 1993 18231 18232@c endfile 18233@end ignore 18234@c file eg/lib/getopt.awk 18235# External variables: 18236# Optind -- index in ARGV of first nonoption argument 18237# Optarg -- string value of argument to current option 18238# Opterr -- if nonzero, print our own diagnostic 18239# Optopt -- current option letter 18240 18241# Returns: 18242# -1 at end of options 18243# ? for unrecognized option 18244# <c> a character representing the current option 18245 18246# Private Data: 18247# _opti -- index in multi-flag option, e.g., -abc 18248@c endfile 18249@end example 18250 18251The function starts out with 18252a list of the global variables it uses, 18253what the return values are, what they mean, and any global variables that 18254are ``private'' to this library function. Such documentation is essential 18255for any program, and particularly for library functions. 18256 18257The @code{getopt} function first checks that it was indeed called with a string of options 18258(the @code{options} parameter). If @code{options} has a zero length, 18259@code{getopt} immediately returns @minus{}1: 18260 18261@cindex @code{getopt} user-defined function 18262@example 18263@c file eg/lib/getopt.awk 18264function getopt(argc, argv, options, thisopt, i) 18265@{ 18266 if (length(options) == 0) # no options given 18267 return -1 18268 18269@group 18270 if (argv[Optind] == "--") @{ # all done 18271 Optind++ 18272 _opti = 0 18273 return -1 18274@end group 18275 @} else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) @{ 18276 _opti = 0 18277 return -1 18278 @} 18279@c endfile 18280@end example 18281 18282The next thing to check for is the end of the options. A @option{--} 18283ends the command-line options, as does any command-line argument that 18284does not begin with a @samp{-}. @code{Optind} is used to step through 18285the array of command-line arguments; it retains its value across calls 18286to @code{getopt}, because it is a global variable. 18287 18288The regular expression that is used, @code{@w{/^-[^: \t\n\f\r\v\b]/}}, is 18289perhaps a bit of overkill; it checks for a @samp{-} followed by anything 18290that is not whitespace and not a colon. 18291If the current command-line argument does not match this pattern, 18292it is not an option, and it ends option processing: 18293 18294@example 18295@c file eg/lib/getopt.awk 18296 if (_opti == 0) 18297 _opti = 2 18298 thisopt = substr(argv[Optind], _opti, 1) 18299 Optopt = thisopt 18300 i = index(options, thisopt) 18301 if (i == 0) @{ 18302 if (Opterr) 18303 printf("%c -- invalid option\n", 18304 thisopt) > "/dev/stderr" 18305 if (_opti >= length(argv[Optind])) @{ 18306 Optind++ 18307 _opti = 0 18308 @} else 18309 _opti++ 18310 return "?" 18311 @} 18312@c endfile 18313@end example 18314 18315The @code{_opti} variable tracks the position in the current command-line 18316argument (@code{argv[Optind]}). If multiple options are 18317grouped together with one @samp{-} (e.g., @option{-abx}), it is necessary 18318to return them to the user one at a time. 18319 18320If @code{_opti} is equal to zero, it is set to two, which is the index in 18321the string of the next character to look at (we skip the @samp{-}, which 18322is at position one). The variable @code{thisopt} holds the character, 18323obtained with @code{substr}. It is saved in @code{Optopt} for the main 18324program to use. 18325 18326If @code{thisopt} is not in the @code{options} string, then it is an 18327invalid option. If @code{Opterr} is nonzero, @code{getopt} prints an error 18328message on the standard error that is similar to the message from the C 18329version of @code{getopt}. 18330 18331Because the option is invalid, it is necessary to skip it and move on to the 18332next option character. If @code{_opti} is greater than or equal to the 18333length of the current command-line argument, it is necessary to move on 18334to the next argument, so @code{Optind} is incremented and @code{_opti} is reset 18335to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely 18336incremented. 18337 18338In any case, because the option is invalid, @code{getopt} returns @samp{?}. 18339The main program can examine @code{Optopt} if it needs to know what the 18340invalid option letter actually is. Continuing on: 18341 18342@example 18343@c file eg/lib/getopt.awk 18344 if (substr(options, i + 1, 1) == ":") @{ 18345 # get option argument 18346 if (length(substr(argv[Optind], _opti + 1)) > 0) 18347 Optarg = substr(argv[Optind], _opti + 1) 18348 else 18349 Optarg = argv[++Optind] 18350 _opti = 0 18351 @} else 18352 Optarg = "" 18353@c endfile 18354@end example 18355 18356If the option requires an argument, the option letter is followed by a colon 18357in the @code{options} string. If there are remaining characters in the 18358current command-line argument (@code{argv[Optind]}), then the rest of that 18359string is assigned to @code{Optarg}. Otherwise, the next command-line 18360argument is used (@samp{-xFOO} versus @samp{@w{-x FOO}}). In either case, 18361@code{_opti} is reset to zero, because there are no more characters left to 18362examine in the current command-line argument. Continuing: 18363 18364@example 18365@c file eg/lib/getopt.awk 18366 if (_opti == 0 || _opti >= length(argv[Optind])) @{ 18367 Optind++ 18368 _opti = 0 18369 @} else 18370 _opti++ 18371 return thisopt 18372@} 18373@c endfile 18374@end example 18375 18376Finally, if @code{_opti} is either zero or greater than the length of the 18377current command-line argument, it means this element in @code{argv} is 18378through being processed, so @code{Optind} is incremented to point to the 18379next element in @code{argv}. If neither condition is true, then only 18380@code{_opti} is incremented, so that the next option letter can be processed 18381on the next call to @code{getopt}. 18382 18383The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one. 18384@code{Opterr} is set to one, since the default behavior is for @code{getopt} 18385to print a diagnostic message upon seeing an invalid option. @code{Optind} 18386is set to one, since there's no reason to look at the program name, which is 18387in @code{ARGV[0]}: 18388 18389@example 18390@c file eg/lib/getopt.awk 18391BEGIN @{ 18392 Opterr = 1 # default is to diagnose 18393 Optind = 1 # skip ARGV[0] 18394 18395 # test program 18396 if (_getopt_test) @{ 18397 while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1) 18398 printf("c = <%c>, optarg = <%s>\n", 18399 _go_c, Optarg) 18400 printf("non-option arguments:\n") 18401 for (; Optind < ARGC; Optind++) 18402 printf("\tARGV[%d] = <%s>\n", 18403 Optind, ARGV[Optind]) 18404 @} 18405@} 18406@c endfile 18407@end example 18408 18409The rest of the @code{BEGIN} rule is a simple test program. Here is the 18410result of two sample runs of the test program: 18411 18412@example 18413$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x 18414@print{} c = <a>, optarg = <> 18415@print{} c = <c>, optarg = <> 18416@print{} c = <b>, optarg = <ARG> 18417@print{} non-option arguments: 18418@print{} ARGV[3] = <bax> 18419@print{} ARGV[4] = <-x> 18420 18421$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc 18422@print{} c = <a>, optarg = <> 18423@error{} x -- invalid option 18424@print{} c = <?>, optarg = <> 18425@print{} non-option arguments: 18426@print{} ARGV[4] = <xyz> 18427@print{} ARGV[5] = <abc> 18428@end example 18429 18430In both runs, 18431the first @option{--} terminates the arguments to @command{awk}, so that it does 18432not try to interpret the @option{-a}, etc., as its own options. 18433Several of the sample programs presented in 18434@ref{Sample Programs}, 18435use @code{getopt} to process their arguments. 18436@c ENDOFRANGE libfclo 18437@c ENDOFRANGE flibclo 18438@c ENDOFRANGE clop 18439@c ENDOFRANGE oclp 18440 18441@node Passwd Functions 18442@section Reading the User Database 18443 18444@c STARTOFRANGE libfudata 18445@cindex libraries of @command{awk} functions, user database, reading 18446@c STARTOFRANGE flibudata 18447@cindex functions, library, user database, reading 18448@c last comma is part of primary 18449@c STARTOFRANGE udatar 18450@cindex user database, reading 18451@c last comma is part of secondary 18452@c STARTOFRANGE dataur 18453@cindex database, users, reading 18454@cindex @code{PROCINFO} array 18455The @code{PROCINFO} array 18456(@pxref{Built-in Variables}) 18457provides access to the current user's real and effective user and group ID 18458numbers, and if available, the user's supplementary group set. 18459However, because these are numbers, they do not provide very useful 18460information to the average user. There needs to be some way to find the 18461user information associated with the user and group ID numbers. This 18462@value{SECTION} presents a suite of functions for retrieving information from the 18463user database. @xref{Group Functions}, 18464for a similar suite that retrieves information from the group database. 18465 18466@cindex @code{getpwent} function (C library) 18467@cindex @code{getpwent} user-defined function 18468@cindex users, information about, retrieving 18469@cindex login information 18470@cindex account information 18471@cindex password file 18472@cindex files, password 18473The POSIX standard does not define the file where user information is 18474kept. Instead, it provides the @code{<pwd.h>} header file 18475and several C language subroutines for obtaining user information. 18476The primary function is @code{getpwent}, for ``get password entry.'' 18477The ``password'' comes from the original user database file, 18478@file{/etc/passwd}, which stores user information, along with the 18479encrypted passwords (hence the name). 18480 18481@cindex @command{pwcat} program 18482While an @command{awk} program could simply read @file{/etc/passwd} 18483directly, this file may not contain complete information about the 18484system's set of users.@footnote{It is often the case that password 18485information is stored in a network database.} To be sure you are able to 18486produce a readable and complete version of the user database, it is necessary 18487to write a small C program that calls @code{getpwent}. @code{getpwent} 18488is defined as returning a pointer to a @code{struct passwd}. Each time it 18489is called, it returns the next entry in the database. When there are 18490no more entries, it returns @code{NULL}, the null pointer. When this 18491happens, the C program should call @code{endpwent} to close the database. 18492Following is @command{pwcat}, a C program that ``cats'' the password database: 18493 18494@c Use old style function header for portability to old systems (SunOS, HP/UX). 18495 18496@example 18497@c file eg/lib/pwcat.c 18498/* 18499 * pwcat.c 18500 * 18501 * Generate a printable version of the password database 18502 */ 18503@c endfile 18504@ignore 18505@c file eg/lib/pwcat.c 18506/* 18507 * Arnold Robbins, arnold@@gnu.org, May 1993 18508 * Public Domain 18509 */ 18510 18511#if HAVE_CONFIG_H 18512#include <config.h> 18513#endif 18514 18515@c endfile 18516@end ignore 18517@c file eg/lib/pwcat.c 18518#include <stdio.h> 18519#include <pwd.h> 18520 18521@c endfile 18522@ignore 18523@c file eg/lib/pwcat.c 18524#if defined (STDC_HEADERS) 18525#include <stdlib.h> 18526#endif 18527 18528@c endfile 18529@end ignore 18530@c file eg/lib/pwcat.c 18531int 18532main(argc, argv) 18533int argc; 18534char **argv; 18535@{ 18536 struct passwd *p; 18537 18538 while ((p = getpwent()) != NULL) 18539 printf("%s:%s:%ld:%ld:%s:%s:%s\n", 18540 p->pw_name, p->pw_passwd, (long) p->pw_uid, 18541 (long) p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell); 18542 18543 endpwent(); 18544 return 0; 18545@} 18546@c endfile 18547@end example 18548 18549If you don't understand C, don't worry about it. 18550The output from @command{pwcat} is the user database, in the traditional 18551@file{/etc/passwd} format of colon-separated fields. The fields are: 18552 18553@ignore 18554@table @asis 18555@item Login name 18556The user's login name. 18557 18558@item Encrypted password 18559The user's encrypted password. This may not be available on some systems. 18560 18561@item User-ID 18562The user's numeric user ID number. 18563(On some systems it's a C @code{long}, and not an @code{int}. Thus 18564we cast it to @code{long} for all cases.) 18565 18566@item Group-ID 18567The user's numeric group ID number. 18568(Similar comments about @code{long} vs.@: @code{int} apply here.) 18569 18570@item Full name 18571The user's full name, and perhaps other information associated with the 18572user. 18573 18574@item Home directory 18575The user's login (or ``home'') directory (familiar to shell programmers as 18576@code{$HOME}). 18577 18578@item Login shell 18579The program that is run when the user logs in. This is usually a 18580shell, such as @command{bash}. 18581@end table 18582@end ignore 18583 18584@multitable {Encrypted password} {1234567890123456789012345678901234567890123456} 18585@item Login name @tab The user's login name. 18586 18587@item Encrypted password @tab The user's encrypted password. This may not be available on some systems. 18588 18589@item User-ID @tab The user's numeric user ID number. 18590 18591@item Group-ID @tab The user's numeric group ID number. 18592 18593@item Full name @tab The user's full name, and perhaps other information associated with the 18594user. 18595 18596@item Home directory @tab The user's login (or ``home'') directory (familiar to shell programmers as 18597@code{$HOME}). 18598 18599@item Login shell @tab The program that is run when the user logs in. This is usually a 18600shell, such as @command{bash}. 18601@end multitable 18602 18603A few lines representative of @command{pwcat}'s output are as follows: 18604 18605@cindex Jacobs, Andrew 18606@cindex Robbins, Arnold 18607@cindex Robbins, Miriam 18608@example 18609$ pwcat 18610@print{} root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh 18611@print{} nobody:*:65534:65534::/: 18612@print{} daemon:*:1:1::/: 18613@print{} sys:*:2:2::/:/bin/csh 18614@print{} bin:*:3:3::/bin: 18615@print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh 18616@print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh 18617@print{} andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh 18618@dots{} 18619@end example 18620 18621With that introduction, following is a group of functions for getting user 18622information. There are several functions here, corresponding to the C 18623functions of the same names: 18624 18625@c Exercise: simplify all these functions that return values. 18626@c Answer: return foo[key] returns "" if key not there, no need to check with `in'. 18627 18628@cindex @code{_pw_init} user-defined function 18629@example 18630@c file eg/lib/passwdawk.in 18631# passwd.awk --- access password file information 18632@c endfile 18633@ignore 18634@c file eg/lib/passwdawk.in 18635# 18636# Arnold Robbins, arnold@@gnu.org, Public Domain 18637# May 1993 18638# Revised October 2000 18639 18640@c endfile 18641@end ignore 18642@c file eg/lib/passwdawk.in 18643BEGIN @{ 18644 # tailor this to suit your system 18645 _pw_awklib = "/usr/local/libexec/awk/" 18646@} 18647 18648function _pw_init( oldfs, oldrs, olddol0, pwcat, using_fw) 18649@{ 18650 if (_pw_inited) 18651 return 18652 18653 oldfs = FS 18654 oldrs = RS 18655 olddol0 = $0 18656 using_fw = (PROCINFO["FS"] == "FIELDWIDTHS") 18657 FS = ":" 18658 RS = "\n" 18659 18660 pwcat = _pw_awklib "pwcat" 18661 while ((pwcat | getline) > 0) @{ 18662 _pw_byname[$1] = $0 18663 _pw_byuid[$3] = $0 18664 _pw_bycount[++_pw_total] = $0 18665 @} 18666 close(pwcat) 18667 _pw_count = 0 18668 _pw_inited = 1 18669 FS = oldfs 18670 if (using_fw) 18671 FIELDWIDTHS = FIELDWIDTHS 18672 RS = oldrs 18673 $0 = olddol0 18674@} 18675@c endfile 18676@end example 18677 18678@cindex @code{BEGIN} pattern, @code{pwcat} program 18679The @code{BEGIN} rule sets a private variable to the directory where 18680@command{pwcat} is stored. Because it is used to help out an @command{awk} library 18681routine, we have chosen to put it in @file{/usr/local/libexec/awk}; 18682however, you might want it to be in a different directory on your system. 18683 18684The function @code{_pw_init} keeps three copies of the user information 18685in three associative arrays. The arrays are indexed by username 18686(@code{_pw_byname}), by user ID number (@code{_pw_byuid}), and by order of 18687occurrence (@code{_pw_bycount}). 18688The variable @code{_pw_inited} is used for efficiency; @code{_pw_init} 18689needs only to be called once. 18690 18691@cindex @code{getline} command, @code{_pw_init} function 18692Because this function uses @code{getline} to read information from 18693@command{pwcat}, it first saves the values of @code{FS}, @code{RS}, and @code{$0}. 18694It notes in the variable @code{using_fw} whether field splitting 18695with @code{FIELDWIDTHS} is in effect or not. 18696Doing so is necessary, since these functions could be called 18697from anywhere within a user's program, and the user may have his 18698or her 18699own way of splitting records and fields. 18700 18701The @code{using_fw} variable checks @code{PROCINFO["FS"]}, which 18702is @code{"FIELDWIDTHS"} if field splitting is being done with 18703@code{FIELDWIDTHS}. This makes it possible to restore the correct 18704field-splitting mechanism later. The test can only be true for 18705@command{gawk}. It is false if using @code{FS} or on some other 18706@command{awk} implementation. 18707 18708The main part of the function uses a loop to read database lines, split 18709the line into fields, and then store the line into each array as necessary. 18710When the loop is done, @code{@w{_pw_init}} cleans up by closing the pipeline, 18711setting @code{@w{_pw_inited}} to one, and restoring @code{FS} (and @code{FIELDWIDTHS} 18712if necessary), @code{RS}, and @code{$0}. 18713The use of @code{@w{_pw_count}} is explained shortly. 18714 18715@c NEXT ED: All of these functions don't need the ... in ... test. Just 18716@c return the array element, which will be "" if not already there. Duh. 18717@cindex @code{getpwnam} function (C library) 18718The @code{getpwnam} function takes a username as a string argument. If that 18719user is in the database, it returns the appropriate line. Otherwise, it 18720returns the null string: 18721 18722@cindex @code{getpwnam} user-defined function 18723@example 18724@group 18725@c file eg/lib/passwdawk.in 18726function getpwnam(name) 18727@{ 18728 _pw_init() 18729 if (name in _pw_byname) 18730 return _pw_byname[name] 18731 return "" 18732@} 18733@c endfile 18734@end group 18735@end example 18736 18737@cindex @code{getpwuid} function (C library) 18738Similarly, 18739the @code{getpwuid} function takes a user ID number argument. If that 18740user number is in the database, it returns the appropriate line. Otherwise, it 18741returns the null string: 18742 18743@cindex @code{getpwuid} user-defined function 18744@example 18745@c file eg/lib/passwdawk.in 18746function getpwuid(uid) 18747@{ 18748 _pw_init() 18749 if (uid in _pw_byuid) 18750 return _pw_byuid[uid] 18751 return "" 18752@} 18753@c endfile 18754@end example 18755 18756@cindex @code{getpwent} function (C library) 18757The @code{getpwent} function simply steps through the database, one entry at 18758a time. It uses @code{_pw_count} to track its current position in the 18759@code{_pw_bycount} array: 18760 18761@cindex @code{getpwent} user-defined function 18762@example 18763@c file eg/lib/passwdawk.in 18764function getpwent() 18765@{ 18766 _pw_init() 18767 if (_pw_count < _pw_total) 18768 return _pw_bycount[++_pw_count] 18769 return "" 18770@} 18771@c endfile 18772@end example 18773 18774@cindex @code{endpwent} function (C library) 18775The @code{@w{endpwent}} function resets @code{@w{_pw_count}} to zero, so that 18776subsequent calls to @code{getpwent} start over again: 18777 18778@cindex @code{endpwent} user-defined function 18779@example 18780@c file eg/lib/passwdawk.in 18781function endpwent() 18782@{ 18783 _pw_count = 0 18784@} 18785@c endfile 18786@end example 18787 18788A conscious design decision in this suite was made that each subroutine calls 18789@code{@w{_pw_init}} to initialize the database arrays. The overhead of running 18790a separate process to generate the user database, and the I/O to scan it, 18791are only incurred if the user's main program actually calls one of these 18792functions. If this library file is loaded along with a user's program, but 18793none of the routines are ever called, then there is no extra runtime overhead. 18794(The alternative is move the body of @code{@w{_pw_init}} into a 18795@code{BEGIN} rule, which always runs @command{pwcat}. This simplifies the 18796code but runs an extra process that may never be needed.) 18797 18798In turn, calling @code{_pw_init} is not too expensive, because the 18799@code{_pw_inited} variable keeps the program from reading the data more than 18800once. If you are worried about squeezing every last cycle out of your 18801@command{awk} program, the check of @code{_pw_inited} could be moved out of 18802@code{_pw_init} and duplicated in all the other functions. In practice, 18803this is not necessary, since most @command{awk} programs are I/O-bound, and it 18804clutters up the code. 18805 18806The @command{id} program in @ref{Id Program}, 18807uses these functions. 18808@c ENDOFRANGE libfudata 18809@c ENDOFRANGE flibudata 18810@c ENDOFRANGE udatar 18811@c ENDOFRANGE dataur 18812 18813@node Group Functions 18814@section Reading the Group Database 18815 18816@c STARTOFRANGE libfgdata 18817@cindex libraries of @command{awk} functions, group database, reading 18818@c STARTOFRANGE flibgdata 18819@cindex functions, library, group database, reading 18820@c STARTOFRANGE gdatar 18821@cindex group database, reading 18822@c STARTOFRANGE datagr 18823@cindex database, group, reading 18824@cindex @code{PROCINFO} array 18825@cindex @code{getgrent} function (C library) 18826@cindex @code{getgrent} user-defined function 18827@c comma is part of primary 18828@cindex groups, information about 18829@cindex account information 18830@cindex group file 18831@cindex files, group 18832Much of the discussion presented in 18833@ref{Passwd Functions}, 18834applies to the group database as well. Although there has traditionally 18835been a well-known file (@file{/etc/group}) in a well-known format, the POSIX 18836standard only provides a set of C library routines 18837(@code{<grp.h>} and @code{getgrent}) 18838for accessing the information. 18839Even though this file may exist, it likely does not have 18840complete information. Therefore, as with the user database, it is necessary 18841to have a small C program that generates the group database as its output. 18842 18843@cindex @command{grcat} program 18844@command{grcat}, a C program that ``cats'' the group database, 18845is as follows: 18846 18847@example 18848@c file eg/lib/grcat.c 18849/* 18850 * grcat.c 18851 * 18852 * Generate a printable version of the group database 18853 */ 18854@c endfile 18855@ignore 18856@c file eg/lib/grcat.c 18857/* 18858 * Arnold Robbins, arnold@@gnu.org, May 1993 18859 * Public Domain 18860 */ 18861 18862/* For OS/2, do nothing. */ 18863#if HAVE_CONFIG_H 18864#include <config.h> 18865#endif 18866 18867#if defined (STDC_HEADERS) 18868#include <stdlib.h> 18869#endif 18870 18871#ifndef HAVE_GETGRENT 18872int main() { return 0; } 18873#else 18874@c endfile 18875@end ignore 18876@c file eg/lib/grcat.c 18877#include <stdio.h> 18878#include <grp.h> 18879 18880int 18881main(argc, argv) 18882int argc; 18883char **argv; 18884@{ 18885 struct group *g; 18886 int i; 18887 18888 while ((g = getgrent()) != NULL) @{ 18889 printf("%s:%s:%ld:", g->gr_name, g->gr_passwd, 18890 (long) g->gr_gid); 18891 for (i = 0; g->gr_mem[i] != NULL; i++) @{ 18892 printf("%s", g->gr_mem[i]); 18893@group 18894 if (g->gr_mem[i+1] != NULL) 18895 putchar(','); 18896 @} 18897@end group 18898 putchar('\n'); 18899 @} 18900 endgrent(); 18901 return 0; 18902@} 18903@c endfile 18904@end example 18905@ignore 18906@c file eg/lib/grcat.c 18907#endif /* HAVE_GETGRENT */ 18908@c endfile 18909@end ignore 18910 18911Each line in the group database represents one group. The fields are 18912separated with colons and represent the following information: 18913 18914@ignore 18915@table @asis 18916@item Group Name 18917The name of the group. 18918 18919@item Group Password 18920The encrypted group password. In practice, this field is never used. It is 18921usually empty or set to @samp{*}. 18922 18923@item Group ID Number 18924The numeric group ID number. This number is unique within the file. 18925(On some systems it's a C @code{long}, and not an @code{int}. Thus 18926we cast it to @code{long} for all cases.) 18927 18928@item Group Member List 18929A comma-separated list of usernames. These users are members of the group. 18930Modern Unix systems allow users to be members of several groups 18931simultaneously. If your system does, then there are elements 18932@code{"group1"} through @code{"group@var{N}"} in @code{PROCINFO} 18933for those group ID numbers. 18934(Note that @code{PROCINFO} is a @command{gawk} extension; 18935@pxref{Built-in Variables}.) 18936@end table 18937@end ignore 18938 18939@multitable {Encrypted password} {1234567890123456789012345678901234567890123456} 18940@item Group name @tab The group's name. 18941 18942@item Group password @tab The group's encrypted password. In practice, this field is never used; 18943it is usually empty or set to @samp{*}. 18944 18945@item Group-ID @tab 18946The group's numeric group ID number; this number should be unique within the file. 18947 18948@item Group member list @tab 18949A comma-separated list of usernames. These users are members of the group. 18950Modern Unix systems allow users to be members of several groups 18951simultaneously. If your system does, then there are elements 18952@code{"group1"} through @code{"group@var{N}"} in @code{PROCINFO} 18953for those group ID numbers. 18954(Note that @code{PROCINFO} is a @command{gawk} extension; 18955@pxref{Built-in Variables}.) 18956@end multitable 18957 18958Here is what running @command{grcat} might produce: 18959 18960@example 18961$ grcat 18962@print{} wheel:*:0:arnold 18963@print{} nogroup:*:65534: 18964@print{} daemon:*:1: 18965@print{} kmem:*:2: 18966@print{} staff:*:10:arnold,miriam,andy 18967@print{} other:*:20: 18968@dots{} 18969@end example 18970 18971Here are the functions for obtaining information from the group database. 18972There are several, modeled after the C library functions of the same names: 18973 18974@cindex @code{getline} command, @code{_gr_init} user-defined function 18975@cindex @code{_gr_init} user-defined function 18976@example 18977@c file eg/lib/groupawk.in 18978# group.awk --- functions for dealing with the group file 18979@c endfile 18980@ignore 18981@c file eg/lib/groupawk.in 18982# 18983# Arnold Robbins, arnold@@gnu.org, Public Domain 18984# May 1993 18985# Revised October 2000 18986 18987@c endfile 18988@end ignore 18989@c line break on _gr_init for smallbook 18990@c file eg/lib/groupawk.in 18991BEGIN \ 18992@{ 18993 # Change to suit your system 18994 _gr_awklib = "/usr/local/libexec/awk/" 18995@} 18996 18997function _gr_init( oldfs, oldrs, olddol0, grcat, 18998 using_fw, n, a, i) 18999@{ 19000 if (_gr_inited) 19001 return 19002 19003 oldfs = FS 19004 oldrs = RS 19005 olddol0 = $0 19006 using_fw = (PROCINFO["FS"] == "FIELDWIDTHS") 19007 FS = ":" 19008 RS = "\n" 19009 19010 grcat = _gr_awklib "grcat" 19011 while ((grcat | getline) > 0) @{ 19012 if ($1 in _gr_byname) 19013 _gr_byname[$1] = _gr_byname[$1] "," $4 19014 else 19015 _gr_byname[$1] = $0 19016 if ($3 in _gr_bygid) 19017 _gr_bygid[$3] = _gr_bygid[$3] "," $4 19018 else 19019 _gr_bygid[$3] = $0 19020 19021 n = split($4, a, "[ \t]*,[ \t]*") 19022 for (i = 1; i <= n; i++) 19023 if (a[i] in _gr_groupsbyuser) 19024 _gr_groupsbyuser[a[i]] = \ 19025 _gr_groupsbyuser[a[i]] " " $1 19026 else 19027 _gr_groupsbyuser[a[i]] = $1 19028 19029 _gr_bycount[++_gr_count] = $0 19030 @} 19031 close(grcat) 19032 _gr_count = 0 19033 _gr_inited++ 19034 FS = oldfs 19035 if (using_fw) 19036 FIELDWIDTHS = FIELDWIDTHS 19037 RS = oldrs 19038 $0 = olddol0 19039@} 19040@c endfile 19041@end example 19042 19043The @code{BEGIN} rule sets a private variable to the directory where 19044@command{grcat} is stored. Because it is used to help out an @command{awk} library 19045routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might 19046want it to be in a different directory on your system. 19047 19048These routines follow the same general outline as the user database routines 19049(@pxref{Passwd Functions}). 19050The @code{@w{_gr_inited}} variable is used to 19051ensure that the database is scanned no more than once. 19052The @code{@w{_gr_init}} function first saves @code{FS}, @code{FIELDWIDTHS}, @code{RS}, and 19053@code{$0}, and then sets @code{FS} and @code{RS} to the correct values for 19054scanning the group information. 19055 19056The group information is stored is several associative arrays. 19057The arrays are indexed by group name (@code{@w{_gr_byname}}), by group ID number 19058(@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}). 19059There is an additional array indexed by username (@code{@w{_gr_groupsbyuser}}), 19060which is a space-separated list of groups to which each user belongs. 19061 19062Unlike the user database, it is possible to have multiple records in the 19063database for the same group. This is common when a group has a large number 19064of members. A pair of such entries might look like the following: 19065 19066@example 19067tvpeople:*:101:johnny,jay,arsenio 19068tvpeople:*:101:david,conan,tom,joan 19069@end example 19070 19071For this reason, @code{_gr_init} looks to see if a group name or 19072group ID number is already seen. If it is, then the usernames are 19073simply concatenated onto the previous list of users. (There is actually a 19074subtle problem with the code just presented. Suppose that 19075the first time there were no names. This code adds the names with 19076a leading comma. It also doesn't check that there is a @code{$4}.) 19077 19078Finally, @code{_gr_init} closes the pipeline to @command{grcat}, restores 19079@code{FS} (and @code{FIELDWIDTHS} if necessary), @code{RS}, and @code{$0}, 19080initializes @code{_gr_count} to zero 19081(it is used later), and makes @code{_gr_inited} nonzero. 19082 19083@cindex @code{getgrnam} function (C library) 19084The @code{getgrnam} function takes a group name as its argument, and if that 19085group exists, it is returned. Otherwise, @code{getgrnam} returns the null 19086string: 19087 19088@cindex @code{getgrnam} user-defined function 19089@example 19090@c file eg/lib/groupawk.in 19091function getgrnam(group) 19092@{ 19093 _gr_init() 19094 if (group in _gr_byname) 19095 return _gr_byname[group] 19096 return "" 19097@} 19098@c endfile 19099@end example 19100 19101@cindex @code{getgrgid} function (C library) 19102The @code{getgrgid} function is similar, it takes a numeric group ID and 19103looks up the information associated with that group ID: 19104 19105@cindex @code{getgrgid} user-defined function 19106@example 19107@c file eg/lib/groupawk.in 19108function getgrgid(gid) 19109@{ 19110 _gr_init() 19111 if (gid in _gr_bygid) 19112 return _gr_bygid[gid] 19113 return "" 19114@} 19115@c endfile 19116@end example 19117 19118@cindex @code{getgruser} function (C library) 19119The @code{getgruser} function does not have a C counterpart. It takes a 19120username and returns the list of groups that have the user as a member: 19121 19122@cindex @code{getgruser} function, user-defined 19123@example 19124@c file eg/lib/groupawk.in 19125function getgruser(user) 19126@{ 19127 _gr_init() 19128 if (user in _gr_groupsbyuser) 19129 return _gr_groupsbyuser[user] 19130 return "" 19131@} 19132@c endfile 19133@end example 19134 19135@cindex @code{getgrent} function (C library) 19136The @code{getgrent} function steps through the database one entry at a time. 19137It uses @code{_gr_count} to track its position in the list: 19138 19139@cindex @code{getgrent} user-defined function 19140@example 19141@c file eg/lib/groupawk.in 19142function getgrent() 19143@{ 19144 _gr_init() 19145 if (++_gr_count in _gr_bycount) 19146 return _gr_bycount[_gr_count] 19147 return "" 19148@} 19149@c endfile 19150@end example 19151@c ENDOFRANGE clibf 19152 19153@cindex @code{endgrent} function (C library) 19154The @code{endgrent} function resets @code{_gr_count} to zero so that @code{getgrent} can 19155start over again: 19156 19157@cindex @code{endgrent} user-defined function 19158@example 19159@c file eg/lib/groupawk.in 19160function endgrent() 19161@{ 19162 _gr_count = 0 19163@} 19164@c endfile 19165@end example 19166 19167As with the user database routines, each function calls @code{_gr_init} to 19168initialize the arrays. Doing so only incurs the extra overhead of running 19169@command{grcat} if these functions are used (as opposed to moving the body of 19170@code{_gr_init} into a @code{BEGIN} rule). 19171 19172Most of the work is in scanning the database and building the various 19173associative arrays. The functions that the user calls are themselves very 19174simple, relying on @command{awk}'s associative arrays to do work. 19175 19176The @command{id} program in @ref{Id Program}, 19177uses these functions. 19178@c ENDOFRANGE libfgdata 19179@c ENDOFRANGE flibgdata 19180@c ENDOFRANGE gdatar 19181@c ENDOFRANGE libf 19182@c ENDOFRANGE flib 19183@c ENDOFRANGE fudlib 19184@c ENDOFRANGE datagr 19185 19186@node Sample Programs 19187@chapter Practical @command{awk} Programs 19188@c STARTOFRANGE awkpex 19189@cindex @command{awk} programs, examples of 19190 19191@ref{Library Functions}, 19192presents the idea that reading programs in a language contributes to 19193learning that language. This @value{CHAPTER} continues that theme, 19194presenting a potpourri of @command{awk} programs for your reading 19195enjoyment. 19196@ifnotinfo 19197There are three sections. 19198The first describes how to run the programs presented 19199in this @value{CHAPTER}. 19200 19201The second presents @command{awk} 19202versions of several common POSIX utilities. 19203These are programs that you are hopefully already familiar with, 19204and therefore, whose problems are understood. 19205By reimplementing these programs in @command{awk}, 19206you can focus on the @command{awk}-related aspects of solving 19207the programming problem. 19208 19209The third is a grab bag of interesting programs. 19210These solve a number of different data-manipulation and management 19211problems. Many of the programs are short, which emphasizes @command{awk}'s 19212ability to do a lot in just a few lines of code. 19213@end ifnotinfo 19214 19215Many of these programs use the library functions presented in 19216@ref{Library Functions}. 19217 19218@menu 19219* Running Examples:: How to run these examples. 19220* Clones:: Clones of common utilities. 19221* Miscellaneous Programs:: Some interesting @command{awk} programs. 19222@end menu 19223 19224@node Running Examples 19225@section Running the Example Programs 19226 19227To run a given program, you would typically do something like this: 19228 19229@example 19230awk -f @var{program} -- @var{options} @var{files} 19231@end example 19232 19233@noindent 19234Here, @var{program} is the name of the @command{awk} program (such as 19235@file{cut.awk}), @var{options} are any command-line options for the 19236program that start with a @samp{-}, and @var{files} are the actual @value{DF}s. 19237 19238If your system supports the @samp{#!} executable interpreter mechanism 19239(@pxref{Executable Scripts}), 19240you can instead run your program directly: 19241 19242@example 19243cut.awk -c1-8 myfiles > results 19244@end example 19245 19246If your @command{awk} is not @command{gawk}, you may instead need to use this: 19247 19248@example 19249cut.awk -- -c1-8 myfiles > results 19250@end example 19251 19252@node Clones 19253@section Reinventing Wheels for Fun and Profit 19254@c last comma is part of secondary 19255@c STARTOFRANGE posimawk 19256@cindex POSIX, programs, implementing in @command{awk} 19257 19258This @value{SECTION} presents a number of POSIX utilities that are implemented in 19259@command{awk}. Reinventing these programs in @command{awk} is often enjoyable, 19260because the algorithms can be very clearly expressed, and the code is usually 19261very concise and simple. This is true because @command{awk} does so much for you. 19262 19263It should be noted that these programs are not necessarily intended to 19264replace the installed versions on your system. Instead, their 19265purpose is to illustrate @command{awk} language programming for ``real world'' 19266tasks. 19267 19268The programs are presented in alphabetical order. 19269 19270@menu 19271* Cut Program:: The @command{cut} utility. 19272* Egrep Program:: The @command{egrep} utility. 19273* Id Program:: The @command{id} utility. 19274* Split Program:: The @command{split} utility. 19275* Tee Program:: The @command{tee} utility. 19276* Uniq Program:: The @command{uniq} utility. 19277* Wc Program:: The @command{wc} utility. 19278@end menu 19279 19280@node Cut Program 19281@subsection Cutting out Fields and Columns 19282 19283@cindex @command{cut} utility 19284@c STARTOFRANGE cut 19285@cindex @command{cut} utility 19286@c STARTOFRANGE ficut 19287@cindex fields, cutting 19288@c STARTOFRANGE colcut 19289@cindex columns, cutting 19290The @command{cut} utility selects, or ``cuts,'' characters or fields 19291from its standard input and sends them to its standard output. 19292Fields are separated by tabs by default, 19293but you may supply a command-line option to change the field 19294@dfn{delimiter} (i.e., the field-separator character). @command{cut}'s 19295definition of fields is less general than @command{awk}'s. 19296 19297A common use of @command{cut} might be to pull out just the login name of 19298logged-on users from the output of @command{who}. For example, the following 19299pipeline generates a sorted, unique list of the logged-on users: 19300 19301@example 19302who | cut -c1-8 | sort | uniq 19303@end example 19304 19305The options for @command{cut} are: 19306 19307@table @code 19308@item -c @var{list} 19309Use @var{list} as the list of characters to cut out. Items within the list 19310may be separated by commas, and ranges of characters can be separated with 19311dashes. The list @samp{1-8,15,22-35} specifies characters 1 through 193128, 15, and 22 through 35. 19313 19314@item -f @var{list} 19315Use @var{list} as the list of fields to cut out. 19316 19317@item -d @var{delim} 19318Use @var{delim} as the field-separator character instead of the tab 19319character. 19320 19321@item -s 19322Suppress printing of lines that do not contain the field delimiter. 19323@end table 19324 19325The @command{awk} implementation of @command{cut} uses the @code{getopt} library 19326function (@pxref{Getopt Function}) 19327and the @code{join} library function 19328(@pxref{Join Function}). 19329 19330The program begins with a comment describing the options, the library 19331functions needed, and a @code{usage} function that prints out a usage 19332message and exits. @code{usage} is called if invalid arguments are 19333supplied: 19334 19335@cindex @code{cut.awk} program 19336@example 19337@c file eg/prog/cut.awk 19338# cut.awk --- implement cut in awk 19339@c endfile 19340@ignore 19341@c file eg/prog/cut.awk 19342# 19343# Arnold Robbins, arnold@@gnu.org, Public Domain 19344# May 1993 19345 19346@c endfile 19347@end ignore 19348@c file eg/prog/cut.awk 19349# Options: 19350# -f list Cut fields 19351# -d c Field delimiter character 19352# -c list Cut characters 19353# 19354# -s Suppress lines without the delimiter 19355# 19356# Requires getopt and join library functions 19357 19358@group 19359function usage( e1, e2) 19360@{ 19361 e1 = "usage: cut [-f list] [-d c] [-s] [files...]" 19362 e2 = "usage: cut [-c list] [files...]" 19363 print e1 > "/dev/stderr" 19364 print e2 > "/dev/stderr" 19365 exit 1 19366@} 19367@end group 19368@c endfile 19369@end example 19370 19371@noindent 19372The variables @code{e1} and @code{e2} are used so that the function 19373fits nicely on the 19374@ifnotinfo 19375page. 19376@end ifnotinfo 19377@ifnottex 19378screen. 19379@end ifnottex 19380 19381@cindex @code{BEGIN} pattern, running @command{awk} programs and 19382@cindex @code{FS} variable, running @command{awk} programs and 19383Next comes a @code{BEGIN} rule that parses the command-line options. 19384It sets @code{FS} to a single TAB character, because that is @command{cut}'s 19385default field separator. The output field separator is also set to be the 19386same as the input field separator. Then @code{getopt} is used to step 19387through the command-line options. Exactly one of the variables 19388@code{by_fields} or @code{by_chars} is set to true, to indicate that 19389processing should be done by fields or by characters, respectively. 19390When cutting by characters, the output field separator is set to the null 19391string: 19392 19393@example 19394@c file eg/prog/cut.awk 19395BEGIN \ 19396@{ 19397 FS = "\t" # default 19398 OFS = FS 19399 while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{ 19400 if (c == "f") @{ 19401 by_fields = 1 19402 fieldlist = Optarg 19403 @} else if (c == "c") @{ 19404 by_chars = 1 19405 fieldlist = Optarg 19406 OFS = "" 19407 @} else if (c == "d") @{ 19408 if (length(Optarg) > 1) @{ 19409 printf("Using first character of %s" \ 19410 " for delimiter\n", Optarg) > "/dev/stderr" 19411 Optarg = substr(Optarg, 1, 1) 19412 @} 19413 FS = Optarg 19414 OFS = FS 19415 if (FS == " ") # defeat awk semantics 19416 FS = "[ ]" 19417 @} else if (c == "s") 19418 suppress++ 19419 else 19420 usage() 19421 @} 19422 19423 for (i = 1; i < Optind; i++) 19424 ARGV[i] = "" 19425@c endfile 19426@end example 19427 19428@cindex field separators, spaces as 19429Special care is taken when the field delimiter is a space. Using 19430a single space (@code{@w{" "}}) for the value of @code{FS} is 19431incorrect---@command{awk} would separate fields with runs of spaces, 19432tabs, and/or newlines, and we want them to be separated with individual 19433spaces. Also, note that after @code{getopt} is through, we have to 19434clear out all the elements of @code{ARGV} from 1 to @code{Optind}, 19435so that @command{awk} does not try to process the command-line options 19436as @value{FN}s. 19437 19438After dealing with the command-line options, the program verifies that the 19439options make sense. Only one or the other of @option{-c} and @option{-f} 19440should be used, and both require a field list. Then the program calls 19441either @code{set_fieldlist} or @code{set_charlist} to pull apart the 19442list of fields or characters: 19443 19444@example 19445@c file eg/prog/cut.awk 19446 if (by_fields && by_chars) 19447 usage() 19448 19449 if (by_fields == 0 && by_chars == 0) 19450 by_fields = 1 # default 19451 19452 if (fieldlist == "") @{ 19453 print "cut: needs list for -c or -f" > "/dev/stderr" 19454 exit 1 19455 @} 19456 19457 if (by_fields) 19458 set_fieldlist() 19459 else 19460 set_charlist() 19461@} 19462@c endfile 19463@end example 19464 19465@code{set_fieldlist} is used to split the field list apart at the commas 19466and into an array. Then, for each element of the array, it looks to 19467see if it is actually a range, and if so, splits it apart. The range 19468is verified to make sure the first number is smaller than the second. 19469Each number in the list is added to the @code{flist} array, which 19470simply lists the fields that will be printed. Normal field splitting 19471is used. The program lets @command{awk} handle the job of doing the 19472field splitting: 19473 19474@example 19475@c file eg/prog/cut.awk 19476function set_fieldlist( n, m, i, j, k, f, g) 19477@{ 19478 n = split(fieldlist, f, ",") 19479 j = 1 # index in flist 19480 for (i = 1; i <= n; i++) @{ 19481 if (index(f[i], "-") != 0) @{ # a range 19482 m = split(f[i], g, "-") 19483@group 19484 if (m != 2 || g[1] >= g[2]) @{ 19485 printf("bad field list: %s\n", 19486 f[i]) > "/dev/stderr" 19487 exit 1 19488 @} 19489@end group 19490 for (k = g[1]; k <= g[2]; k++) 19491 flist[j++] = k 19492 @} else 19493 flist[j++] = f[i] 19494 @} 19495 nfields = j - 1 19496@} 19497@c endfile 19498@end example 19499 19500The @code{set_charlist} function is more complicated than @code{set_fieldlist}. 19501The idea here is to use @command{gawk}'s @code{FIELDWIDTHS} variable 19502(@pxref{Constant Size}), 19503which describes constant-width input. When using a character list, that is 19504exactly what we have. 19505 19506Setting up @code{FIELDWIDTHS} is more complicated than simply listing the 19507fields that need to be printed. We have to keep track of the fields to 19508print and also the intervening characters that have to be skipped. 19509For example, suppose you wanted characters 1 through 8, 15, and 1951022 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value 19511for @code{FIELDWIDTHS} is @code{@w{"8 6 1 6 14"}}. This yields five 19512fields, and the fields to print 19513are @code{$1}, @code{$3}, and @code{$5}. 19514The intermediate fields are @dfn{filler}, 19515which is stuff in between the desired data. 19516@code{flist} lists the fields to print, and @code{t} tracks the 19517complete field list, including filler fields: 19518 19519@example 19520@c file eg/prog/cut.awk 19521function set_charlist( field, i, j, f, g, t, 19522 filler, last, len) 19523@{ 19524 field = 1 # count total fields 19525 n = split(fieldlist, f, ",") 19526 j = 1 # index in flist 19527 for (i = 1; i <= n; i++) @{ 19528 if (index(f[i], "-") != 0) @{ # range 19529 m = split(f[i], g, "-") 19530 if (m != 2 || g[1] >= g[2]) @{ 19531 printf("bad character list: %s\n", 19532 f[i]) > "/dev/stderr" 19533 exit 1 19534 @} 19535 len = g[2] - g[1] + 1 19536 if (g[1] > 1) # compute length of filler 19537 filler = g[1] - last - 1 19538 else 19539 filler = 0 19540@group 19541 if (filler) 19542 t[field++] = filler 19543@end group 19544 t[field++] = len # length of field 19545 last = g[2] 19546 flist[j++] = field - 1 19547 @} else @{ 19548 if (f[i] > 1) 19549 filler = f[i] - last - 1 19550 else 19551 filler = 0 19552 if (filler) 19553 t[field++] = filler 19554 t[field++] = 1 19555 last = f[i] 19556 flist[j++] = field - 1 19557 @} 19558 @} 19559 FIELDWIDTHS = join(t, 1, field - 1) 19560 nfields = j - 1 19561@} 19562@c endfile 19563@end example 19564 19565Next is the rule that actually processes the data. If the @option{-s} option 19566is given, then @code{suppress} is true. The first @code{if} statement 19567makes sure that the input record does have the field separator. If 19568@command{cut} is processing fields, @code{suppress} is true, and the field 19569separator character is not in the record, then the record is skipped. 19570 19571If the record is valid, then @command{gawk} has split the data 19572into fields, either using the character in @code{FS} or using fixed-length 19573fields and @code{FIELDWIDTHS}. The loop goes through the list of fields 19574that should be printed. The corresponding field is printed if it contains data. 19575If the next field also has data, then the separator character is 19576written out between the fields: 19577 19578@example 19579@c file eg/prog/cut.awk 19580@{ 19581 if (by_fields && suppress && index($0, FS) != 0) 19582 next 19583 19584 for (i = 1; i <= nfields; i++) @{ 19585 if ($flist[i] != "") @{ 19586 printf "%s", $flist[i] 19587 if (i < nfields && $flist[i+1] != "") 19588 printf "%s", OFS 19589 @} 19590 @} 19591 print "" 19592@} 19593@c endfile 19594@end example 19595 19596This version of @command{cut} relies on @command{gawk}'s @code{FIELDWIDTHS} 19597variable to do the character-based cutting. While it is possible in 19598other @command{awk} implementations to use @code{substr} 19599(@pxref{String Functions}), 19600it is also extremely painful. 19601The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem 19602of picking the input line apart by characters. 19603@c ENDOFRANGE cut 19604@c ENDOFRANGE ficut 19605@c ENDOFRANGE colcut 19606 19607@c Exercise: Rewrite using split with "". 19608 19609@node Egrep Program 19610@subsection Searching for Regular Expressions in Files 19611 19612@c STARTOFRANGE regexps 19613@cindex regular expressions, searching for 19614@c STARTOFRANGE sfregexp 19615@cindex searching, files for regular expressions 19616@c STARTOFRANGE fsregexp 19617@cindex files, searching for regular expressions 19618@cindex @command{egrep} utility 19619The @command{egrep} utility searches files for patterns. It uses regular 19620expressions that are almost identical to those available in @command{awk} 19621(@pxref{Regexp}). 19622It is used in the following manner: 19623 19624@example 19625egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{} 19626@end example 19627 19628The @var{pattern} is a regular expression. In typical usage, the regular 19629expression is quoted to prevent the shell from expanding any of the 19630special characters as @value{FN} wildcards. Normally, @command{egrep} 19631prints the lines that matched. If multiple @value{FN}s are provided on 19632the command line, each output line is preceded by the name of the file 19633and a colon. 19634 19635The options to @command{egrep} are as follows: 19636 19637@table @code 19638@item -c 19639Print out a count of the lines that matched the pattern, instead of the 19640lines themselves. 19641 19642@item -s 19643Be silent. No output is produced and the exit value indicates whether 19644the pattern was matched. 19645 19646@item -v 19647Invert the sense of the test. @command{egrep} prints the lines that do 19648@emph{not} match the pattern and exits successfully if the pattern is not 19649matched. 19650 19651@item -i 19652Ignore case distinctions in both the pattern and the input data. 19653 19654@item -l 19655Only print (list) the names of the files that matched, not the lines that matched. 19656 19657@item -e @var{pattern} 19658Use @var{pattern} as the regexp to match. The purpose of the @option{-e} 19659option is to allow patterns that start with a @samp{-}. 19660@end table 19661 19662This version uses the @code{getopt} library function 19663(@pxref{Getopt Function}) 19664and the file transition library program 19665(@pxref{Filetrans Function}). 19666 19667The program begins with a descriptive comment and then a @code{BEGIN} rule 19668that processes the command-line arguments with @code{getopt}. The @option{-i} 19669(ignore case) option is particularly easy with @command{gawk}; we just use the 19670@code{IGNORECASE} built-in variable 19671(@pxref{Built-in Variables}): 19672 19673@cindex @code{egrep.awk} program 19674@example 19675@c file eg/prog/egrep.awk 19676# egrep.awk --- simulate egrep in awk 19677@c endfile 19678@ignore 19679@c file eg/prog/egrep.awk 19680# 19681# Arnold Robbins, arnold@@gnu.org, Public Domain 19682# May 1993 19683 19684@c endfile 19685@end ignore 19686@c file eg/prog/egrep.awk 19687# Options: 19688# -c count of lines 19689# -s silent - use exit value 19690# -v invert test, success if no match 19691# -i ignore case 19692# -l print filenames only 19693# -e argument is pattern 19694# 19695# Requires getopt and file transition library functions 19696 19697BEGIN @{ 19698 while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{ 19699 if (c == "c") 19700 count_only++ 19701 else if (c == "s") 19702 no_print++ 19703 else if (c == "v") 19704 invert++ 19705 else if (c == "i") 19706 IGNORECASE = 1 19707 else if (c == "l") 19708 filenames_only++ 19709 else if (c == "e") 19710 pattern = Optarg 19711 else 19712 usage() 19713 @} 19714@c endfile 19715@end example 19716 19717Next comes the code that handles the @command{egrep}-specific behavior. If no 19718pattern is supplied with @option{-e}, the first nonoption on the 19719command line is used. The @command{awk} command-line arguments up to @code{ARGV[Optind]} 19720are cleared, so that @command{awk} won't try to process them as files. If no 19721files are specified, the standard input is used, and if multiple files are 19722specified, we make sure to note this so that the @value{FN}s can precede the 19723matched lines in the output: 19724 19725@example 19726@c file eg/prog/egrep.awk 19727 if (pattern == "") 19728 pattern = ARGV[Optind++] 19729 19730 for (i = 1; i < Optind; i++) 19731 ARGV[i] = "" 19732 if (Optind >= ARGC) @{ 19733 ARGV[1] = "-" 19734 ARGC = 2 19735 @} else if (ARGC - Optind > 1) 19736 do_filenames++ 19737 19738# if (IGNORECASE) 19739# pattern = tolower(pattern) 19740@} 19741@c endfile 19742@end example 19743 19744The last two lines are commented out, since they are not needed in 19745@command{gawk}. They should be uncommented if you have to use another version 19746of @command{awk}. 19747 19748The next set of lines should be uncommented if you are not using 19749@command{gawk}. This rule translates all the characters in the input line 19750into lowercase if the @option{-i} option is specified.@footnote{It 19751also introduces a subtle bug; 19752if a match happens, we output the translated line, not the original.} 19753The rule is 19754commented out since it is not necessary with @command{gawk}: 19755 19756@c Exercise: Fix this, w/array and new line as key to original line 19757 19758@example 19759@c file eg/prog/egrep.awk 19760#@{ 19761# if (IGNORECASE) 19762# $0 = tolower($0) 19763#@} 19764@c endfile 19765@end example 19766 19767The @code{beginfile} function is called by the rule in @file{ftrans.awk} 19768when each new file is processed. In this case, it is very simple; all it 19769does is initialize a variable @code{fcount} to zero. @code{fcount} tracks 19770how many lines in the current file matched the pattern 19771(naming the parameter @code{junk} shows we know that @code{beginfile} 19772is called with a parameter, but that we're not interested in its value): 19773 19774@example 19775@c file eg/prog/egrep.awk 19776function beginfile(junk) 19777@{ 19778 fcount = 0 19779@} 19780@c endfile 19781@end example 19782 19783The @code{endfile} function is called after each file has been processed. 19784It affects the output only when the user wants a count of the number of lines that 19785matched. @code{no_print} is true only if the exit status is desired. 19786@code{count_only} is true if line counts are desired. @command{egrep} 19787therefore only prints line counts if printing and counting are enabled. 19788The output format must be adjusted depending upon the number of files to 19789process. Finally, @code{fcount} is added to @code{total}, so that we 19790know the total number of lines that matched the pattern: 19791 19792@example 19793@c file eg/prog/egrep.awk 19794function endfile(file) 19795@{ 19796 if (! no_print && count_only) 19797 if (do_filenames) 19798 print file ":" fcount 19799 else 19800 print fcount 19801 19802 total += fcount 19803@} 19804@c endfile 19805@end example 19806 19807The following rule does most of the work of matching lines. The variable 19808@code{matches} is true if the line matched the pattern. If the user 19809wants lines that did not match, the sense of @code{matches} is inverted 19810using the @samp{!} operator. @code{fcount} is incremented with the value of 19811@code{matches}, which is either one or zero, depending upon a 19812successful or unsuccessful match. If the line does not match, the 19813@code{next} statement just moves on to the next record. 19814 19815@cindex @code{!} (exclamation point), @code{!} operator 19816@cindex exclamation point (@code{!}), @code{!} operator 19817A number of additional tests are made, but they are only done if we 19818are not counting lines. First, if the user only wants exit status 19819(@code{no_print} is true), then it is enough to know that @emph{one} 19820line in this file matched, and we can skip on to the next file with 19821@code{nextfile}. Similarly, if we are only printing @value{FN}s, we can 19822print the @value{FN}, and then skip to the next file with @code{nextfile}. 19823Finally, each line is printed, with a leading @value{FN} and colon 19824if necessary: 19825 19826@cindex @code{!} operator 19827@example 19828@c file eg/prog/egrep.awk 19829@{ 19830 matches = ($0 ~ pattern) 19831 if (invert) 19832 matches = ! matches 19833 19834 fcount += matches # 1 or 0 19835 19836 if (! matches) 19837 next 19838 19839 if (! count_only) @{ 19840 if (no_print) 19841 nextfile 19842 19843 if (filenames_only) @{ 19844 print FILENAME 19845 nextfile 19846 @} 19847 19848 if (do_filenames) 19849 print FILENAME ":" $0 19850 else 19851 print 19852 @} 19853@} 19854@c endfile 19855@end example 19856 19857The @code{END} rule takes care of producing the correct exit status. If 19858there are no matches, the exit status is one; otherwise it is zero: 19859 19860@example 19861@c file eg/prog/egrep.awk 19862END \ 19863@{ 19864 if (total == 0) 19865 exit 1 19866 exit 0 19867@} 19868@c endfile 19869@end example 19870 19871The @code{usage} function prints a usage message in case of invalid options, 19872and then exits: 19873 19874@example 19875@c file eg/prog/egrep.awk 19876function usage( e) 19877@{ 19878 e = "Usage: egrep [-csvil] [-e pat] [files ...]" 19879 e = e "\n\tegrep [-csvil] pat [files ...]" 19880 print e > "/dev/stderr" 19881 exit 1 19882@} 19883@c endfile 19884@end example 19885 19886The variable @code{e} is used so that the function fits nicely 19887on the printed page. 19888 19889@cindex @code{END} pattern, backslash continuation and 19890@cindex @code{\} (backslash), continuing lines and 19891@cindex backslash (@code{\}), continuing lines and 19892Just a note on programming style: you may have noticed that the @code{END} 19893rule uses backslash continuation, with the open brace on a line by 19894itself. This is so that it more closely resembles the way functions 19895are written. Many of the examples 19896in this @value{CHAPTER} 19897use this style. You can decide for yourself if you like writing 19898your @code{BEGIN} and @code{END} rules this way 19899or not. 19900@c ENDOFRANGE regexps 19901@c ENDOFRANGE sfregexp 19902@c ENDOFRANGE fsregexp 19903 19904@node Id Program 19905@subsection Printing out User Information 19906 19907@cindex printing, user information 19908@cindex users, information about, printing 19909@cindex @command{id} utility 19910The @command{id} utility lists a user's real and effective user ID numbers, 19911real and effective group ID numbers, and the user's group set, if any. 19912@command{id} only prints the effective user ID and group ID if they are 19913different from the real ones. If possible, @command{id} also supplies the 19914corresponding user and group names. The output might look like this: 19915 19916@example 19917$ id 19918@print{} uid=2076(arnold) gid=10(staff) groups=10(staff),4(tty) 19919@end example 19920 19921This information is part of what is provided by @command{gawk}'s 19922@code{PROCINFO} array (@pxref{Built-in Variables}). 19923However, the @command{id} utility provides a more palatable output than just 19924individual numbers. 19925 19926Here is a simple version of @command{id} written in @command{awk}. 19927It uses the user database library functions 19928(@pxref{Passwd Functions}) 19929and the group database library functions 19930(@pxref{Group Functions}): 19931 19932The program is fairly straightforward. All the work is done in the 19933@code{BEGIN} rule. The user and group ID numbers are obtained from 19934@code{PROCINFO}. 19935The code is repetitive. The entry in the user database for the real user ID 19936number is split into parts at the @samp{:}. The name is the first field. 19937Similar code is used for the effective user ID number and the group 19938numbers: 19939 19940@cindex @code{id.awk} program 19941@example 19942@c file eg/prog/id.awk 19943# id.awk --- implement id in awk 19944# 19945# Requires user and group library functions 19946@c endfile 19947@ignore 19948@c file eg/prog/id.awk 19949# 19950# Arnold Robbins, arnold@@gnu.org, Public Domain 19951# May 1993 19952# Revised February 1996 19953 19954@c endfile 19955@end ignore 19956@c file eg/prog/id.awk 19957# output is: 19958# uid=12(foo) euid=34(bar) gid=3(baz) \ 19959# egid=5(blat) groups=9(nine),2(two),1(one) 19960 19961@group 19962BEGIN \ 19963@{ 19964 uid = PROCINFO["uid"] 19965 euid = PROCINFO["euid"] 19966 gid = PROCINFO["gid"] 19967 egid = PROCINFO["egid"] 19968@end group 19969 19970 printf("uid=%d", uid) 19971 pw = getpwuid(uid) 19972 if (pw != "") @{ 19973 split(pw, a, ":") 19974 printf("(%s)", a[1]) 19975 @} 19976 19977 if (euid != uid) @{ 19978 printf(" euid=%d", euid) 19979 pw = getpwuid(euid) 19980 if (pw != "") @{ 19981 split(pw, a, ":") 19982 printf("(%s)", a[1]) 19983 @} 19984 @} 19985 19986 printf(" gid=%d", gid) 19987 pw = getgrgid(gid) 19988 if (pw != "") @{ 19989 split(pw, a, ":") 19990 printf("(%s)", a[1]) 19991 @} 19992 19993 if (egid != gid) @{ 19994 printf(" egid=%d", egid) 19995 pw = getgrgid(egid) 19996 if (pw != "") @{ 19997 split(pw, a, ":") 19998 printf("(%s)", a[1]) 19999 @} 20000 @} 20001 20002 for (i = 1; ("group" i) in PROCINFO; i++) @{ 20003 if (i == 1) 20004 printf(" groups=") 20005 group = PROCINFO["group" i] 20006 printf("%d", group) 20007 pw = getgrgid(group) 20008 if (pw != "") @{ 20009 split(pw, a, ":") 20010 printf("(%s)", a[1]) 20011 @} 20012 if (("group" (i+1)) in PROCINFO) 20013 printf(",") 20014 @} 20015 20016 print "" 20017@} 20018@c endfile 20019@end example 20020 20021@cindex @code{in} operator 20022The test in the @code{for} loop is worth noting. 20023Any supplementary groups in the @code{PROCINFO} array have the 20024indices @code{"group1"} through @code{"group@var{N}"} for some 20025@var{N}, i.e., the total number of supplementary groups. 20026However, we don't know in advance how many of these groups 20027there are. 20028 20029This loop works by starting at one, concatenating the value with 20030@code{"group"}, and then using @code{in} to see if that value is 20031in the array. Eventually, @code{i} is incremented past 20032the last group in the array and the loop exits. 20033 20034The loop is also correct if there are @emph{no} supplementary 20035groups; then the condition is false the first time it's 20036tested, and the loop body never executes. 20037 20038@c exercise!!! 20039@ignore 20040The POSIX version of @command{id} takes arguments that control which 20041information is printed. Modify this version to accept the same 20042arguments and perform in the same way. 20043@end ignore 20044 20045@node Split Program 20046@subsection Splitting a Large File into Pieces 20047 20048@c STARTOFRANGE filspl 20049@cindex files, splitting 20050@cindex @code{split} utility 20051The @code{split} program splits large text files into smaller pieces. 20052Usage is as follows: 20053 20054@example 20055split @r{[}-@var{count}@r{]} file @r{[} @var{prefix} @r{]} 20056@end example 20057 20058By default, 20059the output files are named @file{xaa}, @file{xab}, and so on. Each file has 200601000 lines in it, with the likely exception of the last file. To change the 20061number of lines in each file, supply a number on the command line 20062preceded with a minus; e.g., @samp{-500} for files with 500 lines in them 20063instead of 1000. To change the name of the output files to something like 20064@file{myfileaa}, @file{myfileab}, and so on, supply an additional 20065argument that specifies the @value{FN} prefix. 20066 20067Here is a version of @code{split} in @command{awk}. It uses the @code{ord} and 20068@code{chr} functions presented in 20069@ref{Ordinal Functions}. 20070 20071The program first sets its defaults, and then tests to make sure there are 20072not too many arguments. It then looks at each argument in turn. The 20073first argument could be a minus sign followed by a number. If it is, this happens 20074to look like a negative number, so it is made positive, and that is the 20075count of lines. The data @value{FN} is skipped over and the final argument 20076is used as the prefix for the output @value{FN}s: 20077 20078@cindex @code{split.awk} program 20079@example 20080@c file eg/prog/split.awk 20081# split.awk --- do split in awk 20082# 20083# Requires ord and chr library functions 20084@c endfile 20085@ignore 20086@c file eg/prog/split.awk 20087# 20088# Arnold Robbins, arnold@@gnu.org, Public Domain 20089# May 1993 20090 20091@c endfile 20092@end ignore 20093@c file eg/prog/split.awk 20094# usage: split [-num] [file] [outname] 20095 20096BEGIN @{ 20097 outfile = "x" # default 20098 count = 1000 20099 if (ARGC > 4) 20100 usage() 20101 20102 i = 1 20103 if (ARGV[i] ~ /^-[0-9]+$/) @{ 20104 count = -ARGV[i] 20105 ARGV[i] = "" 20106 i++ 20107 @} 20108 # test argv in case reading from stdin instead of file 20109 if (i in ARGV) 20110 i++ # skip data file name 20111 if (i in ARGV) @{ 20112 outfile = ARGV[i] 20113 ARGV[i] = "" 20114 @} 20115 20116 s1 = s2 = "a" 20117 out = (outfile s1 s2) 20118@} 20119@c endfile 20120@end example 20121 20122The next rule does most of the work. @code{tcount} (temporary count) tracks 20123how many lines have been printed to the output file so far. If it is greater 20124than @code{count}, it is time to close the current file and start a new one. 20125@code{s1} and @code{s2} track the current suffixes for the @value{FN}. If 20126they are both @samp{z}, the file is just too big. Otherwise, @code{s1} 20127moves to the next letter in the alphabet and @code{s2} starts over again at 20128@samp{a}: 20129 20130@c else on separate line here for page breaking 20131@example 20132@c file eg/prog/split.awk 20133@{ 20134 if (++tcount > count) @{ 20135 close(out) 20136 if (s2 == "z") @{ 20137 if (s1 == "z") @{ 20138 printf("split: %s is too large to split\n", 20139 FILENAME) > "/dev/stderr" 20140 exit 1 20141 @} 20142 s1 = chr(ord(s1) + 1) 20143 s2 = "a" 20144 @} 20145@group 20146 else 20147 s2 = chr(ord(s2) + 1) 20148@end group 20149 out = (outfile s1 s2) 20150 tcount = 1 20151 @} 20152 print > out 20153@} 20154@c endfile 20155@end example 20156 20157@c Exercise: do this with just awk builtin functions, index("abc..."), substr, etc. 20158 20159@noindent 20160The @code{usage} function simply prints an error message and exits: 20161 20162@example 20163@c file eg/prog/split.awk 20164function usage( e) 20165@{ 20166 e = "usage: split [-num] [file] [outname]" 20167 print e > "/dev/stderr" 20168 exit 1 20169@} 20170@c endfile 20171@end example 20172 20173@noindent 20174The variable @code{e} is used so that the function 20175fits nicely on the 20176@ifinfo 20177screen. 20178@end ifinfo 20179@ifnotinfo 20180page. 20181@end ifnotinfo 20182 20183This program is a bit sloppy; it relies on @command{awk} to automatically close the last file 20184instead of doing it in an @code{END} rule. 20185It also assumes that letters are contiguous in the character set, 20186which isn't true for EBCDIC systems. 20187@c BFD... 20188@c ENDOFRANGE filspl 20189 20190@node Tee Program 20191@subsection Duplicating Output into Multiple Files 20192 20193@c last comma is part of secondary 20194@cindex files, multiple, duplicating output into 20195@cindex output, duplicating into files 20196@cindex @code{tee} utility 20197The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies 20198its standard input to its standard output and also duplicates it to the 20199files named on the command line. Its usage is as follows: 20200 20201@example 20202tee @r{[}-a@r{]} file @dots{} 20203@end example 20204 20205The @option{-a} option tells @code{tee} to append to the named files, instead of 20206truncating them and starting over. 20207 20208The @code{BEGIN} rule first makes a copy of all the command-line arguments 20209into an array named @code{copy}. 20210@code{ARGV[0]} is not copied, since it is not needed. 20211@code{tee} cannot use @code{ARGV} directly, since @command{awk} attempts to 20212process each @value{FN} in @code{ARGV} as input data. 20213 20214@cindex flag variables 20215If the first argument is @option{-a}, then the flag variable 20216@code{append} is set to true, and both @code{ARGV[1]} and 20217@code{copy[1]} are deleted. If @code{ARGC} is less than two, then no 20218@value{FN}s were supplied and @code{tee} prints a usage message and exits. 20219Finally, @command{awk} is forced to read the standard input by setting 20220@code{ARGV[1]} to @code{"-"} and @code{ARGC} to two: 20221 20222@c NEXT ED: Add more leading commentary in this program 20223@cindex @code{tee.awk} program 20224@example 20225@c file eg/prog/tee.awk 20226# tee.awk --- tee in awk 20227@c endfile 20228@ignore 20229@c file eg/prog/tee.awk 20230# 20231# Arnold Robbins, arnold@@gnu.org, Public Domain 20232# May 1993 20233# Revised December 1995 20234 20235@c endfile 20236@end ignore 20237@c file eg/prog/tee.awk 20238BEGIN \ 20239@{ 20240 for (i = 1; i < ARGC; i++) 20241 copy[i] = ARGV[i] 20242 20243 if (ARGV[1] == "-a") @{ 20244 append = 1 20245 delete ARGV[1] 20246 delete copy[1] 20247 ARGC-- 20248 @} 20249 if (ARGC < 2) @{ 20250 print "usage: tee [-a] file ..." > "/dev/stderr" 20251 exit 1 20252 @} 20253 ARGV[1] = "-" 20254 ARGC = 2 20255@} 20256@c endfile 20257@end example 20258 20259The single rule does all the work. Since there is no pattern, it is 20260executed for each line of input. The body of the rule simply prints the 20261line into each file on the command line, and then to the standard output: 20262 20263@example 20264@c file eg/prog/tee.awk 20265@{ 20266 # moving the if outside the loop makes it run faster 20267 if (append) 20268 for (i in copy) 20269 print >> copy[i] 20270 else 20271 for (i in copy) 20272 print > copy[i] 20273 print 20274@} 20275@c endfile 20276@end example 20277 20278@noindent 20279It is also possible to write the loop this way: 20280 20281@example 20282for (i in copy) 20283 if (append) 20284 print >> copy[i] 20285 else 20286 print > copy[i] 20287@end example 20288 20289@noindent 20290This is more concise but it is also less efficient. The @samp{if} is 20291tested for each record and for each output file. By duplicating the loop 20292body, the @samp{if} is only tested once for each input record. If there are 20293@var{N} input records and @var{M} output files, the first method only 20294executes @var{N} @samp{if} statements, while the second executes 20295@var{N}@code{*}@var{M} @samp{if} statements. 20296 20297Finally, the @code{END} rule cleans up by closing all the output files: 20298 20299@example 20300@c file eg/prog/tee.awk 20301END \ 20302@{ 20303 for (i in copy) 20304 close(copy[i]) 20305@} 20306@c endfile 20307@end example 20308 20309@node Uniq Program 20310@subsection Printing Nonduplicated Lines of Text 20311 20312@c STARTOFRANGE prunt 20313@cindex printing, unduplicated lines of text 20314@c first comma is part of primary 20315@c STARTOFRANGE tpul 20316@cindex text, printing, unduplicated lines of 20317@cindex @command{uniq} utility 20318The @command{uniq} utility reads sorted lines of data on its standard 20319input, and by default removes duplicate lines. In other words, it only 20320prints unique lines---hence the name. @command{uniq} has a number of 20321options. The usage is as follows: 20322 20323@example 20324uniq @r{[}-udc @r{[}-@var{n}@r{]]} @r{[}+@var{n}@r{]} @r{[} @var{input file} @r{[} @var{output file} @r{]]} 20325@end example 20326 20327The options for @command{uniq} are: 20328 20329@table @code 20330@item -d 20331Pnly print only repeated lines. 20332 20333@item -u 20334Print only nonrepeated lines. 20335 20336@item -c 20337Count lines. This option overrides @option{-d} and @option{-u}. Both repeated 20338and nonrepeated lines are counted. 20339 20340@item -@var{n} 20341Skip @var{n} fields before comparing lines. The definition of fields 20342is similar to @command{awk}'s default: nonwhitespace characters separated 20343by runs of spaces and/or tabs. 20344 20345@item +@var{n} 20346Skip @var{n} characters before comparing lines. Any fields specified with 20347@samp{-@var{n}} are skipped first. 20348 20349@item @var{input file} 20350Data is read from the input file named on the command line, instead of from 20351the standard input. 20352 20353@item @var{output file} 20354The generated output is sent to the named output file, instead of to the 20355standard output. 20356@end table 20357 20358Normally @command{uniq} behaves as if both the @option{-d} and 20359@option{-u} options are provided. 20360 20361@command{uniq} uses the 20362@code{getopt} library function 20363(@pxref{Getopt Function}) 20364and the @code{join} library function 20365(@pxref{Join Function}). 20366 20367The program begins with a @code{usage} function and then a brief outline of 20368the options and their meanings in a comment. 20369The @code{BEGIN} rule deals with the command-line arguments and options. It 20370uses a trick to get @code{getopt} to handle options of the form @samp{-25}, 20371treating such an option as the option letter @samp{2} with an argument of 20372@samp{5}. If indeed two or more digits are supplied (@code{Optarg} looks 20373like a number), @code{Optarg} is 20374concatenated with the option digit and then the result is added to zero to make 20375it into a number. If there is only one digit in the option, then 20376@code{Optarg} is not needed. In this case, @code{Optind} must be decremented so that 20377@code{getopt} processes it next time. This code is admittedly a bit 20378tricky. 20379 20380If no options are supplied, then the default is taken, to print both 20381repeated and nonrepeated lines. The output file, if provided, is assigned 20382to @code{outputfile}. Early on, @code{outputfile} is initialized to the 20383standard output, @file{/dev/stdout}: 20384 20385@cindex @code{uniq.awk} program 20386@example 20387@c file eg/prog/uniq.awk 20388@group 20389# uniq.awk --- do uniq in awk 20390# 20391# Requires getopt and join library functions 20392@end group 20393@c endfile 20394@ignore 20395@c file eg/prog/uniq.awk 20396# 20397# Arnold Robbins, arnold@@gnu.org, Public Domain 20398# May 1993 20399 20400@c endfile 20401@end ignore 20402@c file eg/prog/uniq.awk 20403function usage( e) 20404@{ 20405 e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]" 20406 print e > "/dev/stderr" 20407 exit 1 20408@} 20409 20410# -c count lines. overrides -d and -u 20411# -d only repeated lines 20412# -u only non-repeated lines 20413# -n skip n fields 20414# +n skip n characters, skip fields first 20415 20416BEGIN \ 20417@{ 20418 count = 1 20419 outputfile = "/dev/stdout" 20420 opts = "udc0:1:2:3:4:5:6:7:8:9:" 20421 while ((c = getopt(ARGC, ARGV, opts)) != -1) @{ 20422 if (c == "u") 20423 non_repeated_only++ 20424 else if (c == "d") 20425 repeated_only++ 20426 else if (c == "c") 20427 do_count++ 20428 else if (index("0123456789", c) != 0) @{ 20429 # getopt requires args to options 20430 # this messes us up for things like -5 20431 if (Optarg ~ /^[0-9]+$/) 20432 fcount = (c Optarg) + 0 20433 else @{ 20434 fcount = c + 0 20435 Optind-- 20436 @} 20437 @} else 20438 usage() 20439 @} 20440 20441 if (ARGV[Optind] ~ /^\+[0-9]+$/) @{ 20442 charcount = substr(ARGV[Optind], 2) + 0 20443 Optind++ 20444 @} 20445 20446 for (i = 1; i < Optind; i++) 20447 ARGV[i] = "" 20448 20449 if (repeated_only == 0 && non_repeated_only == 0) 20450 repeated_only = non_repeated_only = 1 20451 20452 if (ARGC - Optind == 2) @{ 20453 outputfile = ARGV[ARGC - 1] 20454 ARGV[ARGC - 1] = "" 20455 @} 20456@} 20457@c endfile 20458@end example 20459 20460The following function, @code{are_equal}, compares the current line, 20461@code{$0}, to the 20462previous line, @code{last}. It handles skipping fields and characters. 20463If no field count and no character count are specified, @code{are_equal} 20464simply returns one or zero depending upon the result of a simple string 20465comparison of @code{last} and @code{$0}. Otherwise, things get more 20466complicated. 20467If fields have to be skipped, each line is broken into an array using 20468@code{split} 20469(@pxref{String Functions}); 20470the desired fields are then joined back into a line using @code{join}. 20471The joined lines are stored in @code{clast} and @code{cline}. 20472If no fields are skipped, @code{clast} and @code{cline} are set to 20473@code{last} and @code{$0}, respectively. 20474Finally, if characters are skipped, @code{substr} is used to strip off the 20475leading @code{charcount} characters in @code{clast} and @code{cline}. The 20476two strings are then compared and @code{are_equal} returns the result: 20477 20478@example 20479@c file eg/prog/uniq.awk 20480function are_equal( n, m, clast, cline, alast, aline) 20481@{ 20482 if (fcount == 0 && charcount == 0) 20483 return (last == $0) 20484 20485 if (fcount > 0) @{ 20486 n = split(last, alast) 20487 m = split($0, aline) 20488 clast = join(alast, fcount+1, n) 20489 cline = join(aline, fcount+1, m) 20490 @} else @{ 20491 clast = last 20492 cline = $0 20493 @} 20494 if (charcount) @{ 20495 clast = substr(clast, charcount + 1) 20496 cline = substr(cline, charcount + 1) 20497 @} 20498 20499 return (clast == cline) 20500@} 20501@c endfile 20502@end example 20503 20504The following two rules are the body of the program. The first one is 20505executed only for the very first line of data. It sets @code{last} equal to 20506@code{$0}, so that subsequent lines of text have something to be compared to. 20507 20508The second rule does the work. The variable @code{equal} is one or zero, 20509depending upon the results of @code{are_equal}'s comparison. If @command{uniq} 20510is counting repeated lines, and the lines are equal, then it increments the @code{count} variable. 20511Otherwise, it prints the line and resets @code{count}, 20512since the two lines are not equal. 20513 20514If @command{uniq} is not counting, and if the lines are equal, @code{count} is incremented. 20515Nothing is printed, since the point is to remove duplicates. 20516Otherwise, if @command{uniq} is counting repeated lines and more than 20517one line is seen, or if @command{uniq} is counting nonrepeated lines 20518and only one line is seen, then the line is printed, and @code{count} 20519is reset. 20520 20521Finally, similar logic is used in the @code{END} rule to print the final 20522line of input data: 20523 20524@example 20525@c file eg/prog/uniq.awk 20526NR == 1 @{ 20527 last = $0 20528 next 20529@} 20530 20531@{ 20532 equal = are_equal() 20533 20534 if (do_count) @{ # overrides -d and -u 20535 if (equal) 20536 count++ 20537 else @{ 20538 printf("%4d %s\n", count, last) > outputfile 20539 last = $0 20540 count = 1 # reset 20541 @} 20542 next 20543 @} 20544 20545 if (equal) 20546 count++ 20547 else @{ 20548 if ((repeated_only && count > 1) || 20549 (non_repeated_only && count == 1)) 20550 print last > outputfile 20551 last = $0 20552 count = 1 20553 @} 20554@} 20555 20556END @{ 20557 if (do_count) 20558 printf("%4d %s\n", count, last) > outputfile 20559 else if ((repeated_only && count > 1) || 20560 (non_repeated_only && count == 1)) 20561 print last > outputfile 20562@} 20563@c endfile 20564@end example 20565@c ENDOFRANGE prunt 20566@c ENDOFRANGE tpul 20567 20568@node Wc Program 20569@subsection Counting Things 20570 20571@c STARTOFRANGE count 20572@cindex counting 20573@c STARTOFRANGE infco 20574@cindex input files, counting elements in 20575@c STARTOFRANGE woco 20576@cindex words, counting 20577@c STARTOFRANGE chco 20578@cindex characters, counting 20579@c STARTOFRANGE lico 20580@cindex lines, counting 20581@cindex @command{wc} utility 20582The @command{wc} (word count) utility counts lines, words, and characters in 20583one or more input files. Its usage is as follows: 20584 20585@example 20586wc @r{[}-lwc@r{]} @r{[} @var{files} @dots{} @r{]} 20587@end example 20588 20589If no files are specified on the command line, @command{wc} reads its standard 20590input. If there are multiple files, it also prints total counts for all 20591the files. The options and their meanings are shown in the following list: 20592 20593@table @code 20594@item -l 20595Count only lines. 20596 20597@item -w 20598Count only words. 20599A ``word'' is a contiguous sequence of nonwhitespace characters, separated 20600by spaces and/or tabs. Luckily, this is the normal way @command{awk} separates 20601fields in its input data. 20602 20603@item -c 20604Count only characters. 20605@end table 20606 20607Implementing @command{wc} in @command{awk} is particularly elegant, 20608since @command{awk} does a lot of the work for us; it splits lines into 20609words (i.e., fields) and counts them, it counts lines (i.e., records), 20610and it can easily tell us how long a line is. 20611 20612This uses the @code{getopt} library function 20613(@pxref{Getopt Function}) 20614and the file-transition functions 20615(@pxref{Filetrans Function}). 20616 20617This version has one notable difference from traditional versions of 20618@command{wc}: it always prints the counts in the order lines, words, 20619and characters. Traditional versions note the order of the @option{-l}, 20620@option{-w}, and @option{-c} options on the command line, and print the 20621counts in that order. 20622 20623The @code{BEGIN} rule does the argument processing. The variable 20624@code{print_total} is true if more than one file is named on the 20625command line: 20626 20627@cindex @code{wc.awk} program 20628@example 20629@c file eg/prog/wc.awk 20630# wc.awk --- count lines, words, characters 20631@c endfile 20632@ignore 20633@c file eg/prog/wc.awk 20634# 20635# Arnold Robbins, arnold@@gnu.org, Public Domain 20636# May 1993 20637@c endfile 20638@end ignore 20639@c file eg/prog/wc.awk 20640 20641# Options: 20642# -l only count lines 20643# -w only count words 20644# -c only count characters 20645# 20646# Default is to count lines, words, characters 20647# 20648# Requires getopt and file transition library functions 20649 20650BEGIN @{ 20651 # let getopt print a message about 20652 # invalid options. we ignore them 20653 while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{ 20654 if (c == "l") 20655 do_lines = 1 20656 else if (c == "w") 20657 do_words = 1 20658 else if (c == "c") 20659 do_chars = 1 20660 @} 20661 for (i = 1; i < Optind; i++) 20662 ARGV[i] = "" 20663 20664 # if no options, do all 20665 if (! do_lines && ! do_words && ! do_chars) 20666 do_lines = do_words = do_chars = 1 20667 20668 print_total = (ARGC - i > 2) 20669@} 20670@c endfile 20671@end example 20672 20673The @code{beginfile} function is simple; it just resets the counts of lines, 20674words, and characters to zero, and saves the current @value{FN} in 20675@code{fname}: 20676 20677@c NEXT ED: make it lines = words = chars = 0 20678@example 20679@c file eg/prog/wc.awk 20680function beginfile(file) 20681@{ 20682 chars = lines = words = 0 20683 fname = FILENAME 20684@} 20685@c endfile 20686@end example 20687 20688The @code{endfile} function adds the current file's numbers to the running 20689totals of lines, words, and characters.@footnote{@command{wc} can't just use the value of 20690@code{FNR} in @code{endfile}. If you examine 20691the code in 20692@ref{Filetrans Function} 20693you will see that 20694@code{FNR} has already been reset by the time 20695@code{endfile} is called.} It then prints out those numbers 20696for the file that was just read. It relies on @code{beginfile} to reset the 20697numbers for the following @value{DF}: 20698@c ONE DAY: make the above footnote an exercise, instead of giving away the answer. 20699 20700@c NEXT ED: make order for += be lines, words, chars 20701@example 20702@c file eg/prog/wc.awk 20703function endfile(file) 20704@{ 20705 tchars += chars 20706 tlines += lines 20707 twords += words 20708 if (do_lines) 20709 printf "\t%d", lines 20710@group 20711 if (do_words) 20712 printf "\t%d", words 20713@end group 20714 if (do_chars) 20715 printf "\t%d", chars 20716 printf "\t%s\n", fname 20717@} 20718@c endfile 20719@end example 20720 20721There is one rule that is executed for each line. It adds the length of 20722the record, plus one, to @code{chars}. Adding one plus the record length 20723is needed because the newline character separating records (the value 20724of @code{RS}) is not part of the record itself, and thus not included 20725in its length. Next, @code{lines} is incremented for each line read, 20726and @code{words} is incremented by the value of @code{NF}, which is the 20727number of ``words'' on this line: 20728 20729@example 20730@c file eg/prog/wc.awk 20731# do per line 20732@{ 20733 chars += length($0) + 1 # get newline 20734 lines++ 20735 words += NF 20736@} 20737@c endfile 20738@end example 20739 20740Finally, the @code{END} rule simply prints the totals for all the files: 20741 20742@example 20743@c file eg/prog/wc.awk 20744END @{ 20745 if (print_total) @{ 20746 if (do_lines) 20747 printf "\t%d", tlines 20748 if (do_words) 20749 printf "\t%d", twords 20750 if (do_chars) 20751 printf "\t%d", tchars 20752 print "\ttotal" 20753 @} 20754@} 20755@c endfile 20756@end example 20757@c ENDOFRANGE count 20758@c ENDOFRANGE infco 20759@c ENDOFRANGE lico 20760@c ENDOFRANGE woco 20761@c ENDOFRANGE chco 20762@c ENDOFRANGE posimawk 20763 20764@node Miscellaneous Programs 20765@section A Grab Bag of @command{awk} Programs 20766 20767This @value{SECTION} is a large ``grab bag'' of miscellaneous programs. 20768We hope you find them both interesting and enjoyable. 20769 20770@menu 20771* Dupword Program:: Finding duplicated words in a document. 20772* Alarm Program:: An alarm clock. 20773* Translate Program:: A program similar to the @command{tr} utility. 20774* Labels Program:: Printing mailing labels. 20775* Word Sorting:: A program to produce a word usage count. 20776* History Sorting:: Eliminating duplicate entries from a history 20777 file. 20778* Extract Program:: Pulling out programs from Texinfo source 20779 files. 20780* Simple Sed:: A Simple Stream Editor. 20781* Igawk Program:: A wrapper for @command{awk} that includes 20782 files. 20783@end menu 20784 20785@node Dupword Program 20786@subsection Finding Duplicated Words in a Document 20787 20788@c last comma is part of secondary 20789@cindex words, duplicate, searching for 20790@cindex searching, for words 20791@c first comma is part of primary 20792@cindex documents, searching 20793A common error when writing large amounts of prose is to accidentally 20794duplicate words. Typically you will see this in text as something like ``the 20795the program does the following@dots{}'' When the text is online, often 20796the duplicated words occur at the end of one line and the beginning of 20797another, making them very difficult to spot. 20798@c as here! 20799 20800This program, @file{dupword.awk}, scans through a file one line at a time 20801and looks for adjacent occurrences of the same word. It also saves the last 20802word on a line (in the variable @code{prev}) for comparison with the first 20803word on the next line. 20804 20805@cindex Texinfo 20806The first two statements make sure that the line is all lowercase, 20807so that, for example, ``The'' and ``the'' compare equal to each other. 20808The next statement replaces nonalphanumeric and nonwhitespace characters 20809with spaces, so that punctuation does not affect the comparison either. 20810The characters are replaced with spaces so that formatting controls 20811don't create nonsense words (e.g., the Texinfo @samp{@@code@{NF@}} 20812becomes @samp{codeNF} if punctuation is simply deleted). The record is 20813then resplit into fields, yielding just the actual words on the line, 20814and ensuring that there are no empty fields. 20815 20816If there are no fields left after removing all the punctuation, the 20817current record is skipped. Otherwise, the program loops through each 20818word, comparing it to the previous one: 20819 20820@cindex @code{dupword.awk} program 20821@example 20822@c file eg/prog/dupword.awk 20823# dupword.awk --- find duplicate words in text 20824@c endfile 20825@ignore 20826@c file eg/prog/dupword.awk 20827# 20828# Arnold Robbins, arnold@@gnu.org, Public Domain 20829# December 1991 20830# Revised October 2000 20831 20832@c endfile 20833@end ignore 20834@c file eg/prog/dupword.awk 20835@{ 20836 $0 = tolower($0) 20837 gsub(/[^[:alnum:][:blank:]]/, " "); 20838 $0 = $0 # re-split 20839 if (NF == 0) 20840 next 20841 if ($1 == prev) 20842 printf("%s:%d: duplicate %s\n", 20843 FILENAME, FNR, $1) 20844 for (i = 2; i <= NF; i++) 20845 if ($i == $(i-1)) 20846 printf("%s:%d: duplicate %s\n", 20847 FILENAME, FNR, $i) 20848 prev = $NF 20849@} 20850@c endfile 20851@end example 20852 20853@node Alarm Program 20854@subsection An Alarm Clock Program 20855@cindex insomnia, cure for 20856@cindex Robbins, Arnold 20857@quotation 20858@i{Nothing cures insomnia like a ringing alarm clock.}@* 20859Arnold Robbins 20860@end quotation 20861 20862@c STARTOFRANGE tialarm 20863@cindex time, alarm clock example program 20864@c STARTOFRANGE alaex 20865@cindex alarm clock example program 20866The following program is a simple ``alarm clock'' program. 20867You give it a time of day and an optional message. At the specified time, 20868it prints the message on the standard output. In addition, you can give it 20869the number of times to repeat the message as well as a delay between 20870repetitions. 20871 20872This program uses the @code{gettimeofday} function from 20873@ref{Gettimeofday Function}. 20874 20875All the work is done in the @code{BEGIN} rule. The first part is argument 20876checking and setting of defaults: the delay, the count, and the message to 20877print. If the user supplied a message without the ASCII BEL 20878character (known as the ``alert'' character, @code{"\a"}), then it is added to 20879the message. (On many systems, printing the ASCII BEL generates an 20880audible alert. Thus when the alarm goes off, the system calls attention 20881to itself in case the user is not looking at the computer or terminal.) 20882Here is the program: 20883 20884@cindex @code{alarm.awk} program 20885@example 20886@c file eg/prog/alarm.awk 20887# alarm.awk --- set an alarm 20888# 20889# Requires gettimeofday library function 20890@c endfile 20891@ignore 20892@c file eg/prog/alarm.awk 20893# 20894# Arnold Robbins, arnold@@gnu.org, Public Domain 20895# May 1993 20896 20897@c endfile 20898@end ignore 20899@c file eg/prog/alarm.awk 20900# usage: alarm time [ "message" [ count [ delay ] ] ] 20901 20902BEGIN \ 20903@{ 20904 # Initial argument sanity checking 20905 usage1 = "usage: alarm time ['message' [count [delay]]]" 20906 usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1]) 20907 20908 if (ARGC < 2) @{ 20909 print usage1 > "/dev/stderr" 20910 print usage2 > "/dev/stderr" 20911 exit 1 20912 @} else if (ARGC == 5) @{ 20913 delay = ARGV[4] + 0 20914 count = ARGV[3] + 0 20915 message = ARGV[2] 20916 @} else if (ARGC == 4) @{ 20917 count = ARGV[3] + 0 20918 message = ARGV[2] 20919 @} else if (ARGC == 3) @{ 20920 message = ARGV[2] 20921 @} else if (ARGV[1] !~ /[0-9]?[0-9]:[0-9][0-9]/) @{ 20922 print usage1 > "/dev/stderr" 20923 print usage2 > "/dev/stderr" 20924 exit 1 20925 @} 20926 20927 # set defaults for once we reach the desired time 20928 if (delay == 0) 20929 delay = 180 # 3 minutes 20930@group 20931 if (count == 0) 20932 count = 5 20933@end group 20934 if (message == "") 20935 message = sprintf("\aIt is now %s!\a", ARGV[1]) 20936 else if (index(message, "\a") == 0) 20937 message = "\a" message "\a" 20938@c endfile 20939@end example 20940 20941The next @value{SECTION} of code turns the alarm time into hours and minutes, 20942converts it (if necessary) to a 24-hour clock, and then turns that 20943time into a count of the seconds since midnight. Next it turns the current 20944time into a count of seconds since midnight. The difference between the two 20945is how long to wait before setting off the alarm: 20946 20947@example 20948@c file eg/prog/alarm.awk 20949 # split up alarm time 20950 split(ARGV[1], atime, ":") 20951 hour = atime[1] + 0 # force numeric 20952 minute = atime[2] + 0 # force numeric 20953 20954 # get current broken down time 20955 gettimeofday(now) 20956 20957 # if time given is 12-hour hours and it's after that 20958 # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m., 20959 # then add 12 to real hour 20960 if (hour < 12 && now["hour"] > hour) 20961 hour += 12 20962 20963 # set target time in seconds since midnight 20964 target = (hour * 60 * 60) + (minute * 60) 20965 20966 # get current time in seconds since midnight 20967 current = (now["hour"] * 60 * 60) + \ 20968 (now["minute"] * 60) + now["second"] 20969 20970 # how long to sleep for 20971 naptime = target - current 20972 if (naptime <= 0) @{ 20973 print "time is in the past!" > "/dev/stderr" 20974 exit 1 20975 @} 20976@c endfile 20977@end example 20978 20979@cindex @command{sleep} utility 20980Finally, the program uses the @code{system} function 20981(@pxref{I/O Functions}) 20982to call the @command{sleep} utility. The @command{sleep} utility simply pauses 20983for the given number of seconds. If the exit status is not zero, 20984the program assumes that @command{sleep} was interrupted and exits. If 20985@command{sleep} exited with an OK status (zero), then the program prints the 20986message in a loop, again using @command{sleep} to delay for however many 20987seconds are necessary: 20988 20989@example 20990@c file eg/prog/alarm.awk 20991 # zzzzzz..... go away if interrupted 20992 if (system(sprintf("sleep %d", naptime)) != 0) 20993 exit 1 20994 20995 # time to notify! 20996 command = sprintf("sleep %d", delay) 20997 for (i = 1; i <= count; i++) @{ 20998 print message 20999 # if sleep command interrupted, go away 21000 if (system(command) != 0) 21001 break 21002 @} 21003 21004 exit 0 21005@} 21006@c endfile 21007@end example 21008@c ENDOFRANGE tialarm 21009@c ENDOFRANGE alaex 21010 21011@node Translate Program 21012@subsection Transliterating Characters 21013 21014@c STARTOFRANGE chtra 21015@cindex characters, transliterating 21016@cindex @command{tr} utility 21017The system @command{tr} utility transliterates characters. For example, it is 21018often used to map uppercase letters into lowercase for further processing: 21019 21020@example 21021@var{generate data} | tr 'A-Z' 'a-z' | @var{process data} @dots{} 21022@end example 21023 21024@command{tr} requires two lists of characters.@footnote{On some older 21025System V systems, 21026@ifset ORA 21027including Solaris, 21028@end ifset 21029@command{tr} may require that the lists be written as 21030range expressions enclosed in square brackets (@samp{[a-z]}) and quoted, 21031to prevent the shell from attempting a @value{FN} expansion. This is 21032not a feature.} When processing the input, the first character in the 21033first list is replaced with the first character in the second list, 21034the second character in the first list is replaced with the second 21035character in the second list, and so on. If there are more characters 21036in the ``from'' list than in the ``to'' list, the last character of the 21037``to'' list is used for the remaining characters in the ``from'' list. 21038 21039Some time ago, 21040@c early or mid-1989! 21041a user proposed that a transliteration function should 21042be added to @command{gawk}. 21043@c Wishing to avoid gratuitous new features, 21044@c at least theoretically 21045The following program was written to 21046prove that character transliteration could be done with a user-level 21047function. This program is not as complete as the system @command{tr} utility 21048but it does most of the job. 21049 21050The @command{translate} program demonstrates one of the few weaknesses 21051of standard @command{awk}: dealing with individual characters is very 21052painful, requiring repeated use of the @code{substr}, @code{index}, 21053and @code{gsub} built-in functions 21054(@pxref{String Functions}).@footnote{This 21055program was written before @command{gawk} acquired the ability to 21056split each character in a string into separate array elements.} 21057@c Exercise: How might you use this new feature to simplify the program? 21058There are two functions. The first, @code{stranslate}, takes three 21059arguments: 21060 21061@table @code 21062@item from 21063A list of characters from which to translate. 21064 21065@item to 21066A list of characters to which to translate. 21067 21068@item target 21069The string on which to do the translation. 21070@end table 21071 21072Associative arrays make the translation part fairly easy. @code{t_ar} holds 21073the ``to'' characters, indexed by the ``from'' characters. Then a simple 21074loop goes through @code{from}, one character at a time. For each character 21075in @code{from}, if the character appears in @code{target}, @code{gsub} 21076is used to change it to the corresponding @code{to} character. 21077 21078The @code{translate} function simply calls @code{stranslate} using @code{$0} 21079as the target. The main program sets two global variables, @code{FROM} and 21080@code{TO}, from the command line, and then changes @code{ARGV} so that 21081@command{awk} reads from the standard input. 21082 21083Finally, the processing rule simply calls @code{translate} for each record: 21084 21085@cindex @code{translate.awk} program 21086@example 21087@c file eg/prog/translate.awk 21088# translate.awk --- do tr-like stuff 21089@c endfile 21090@ignore 21091@c file eg/prog/translate.awk 21092# 21093# Arnold Robbins, arnold@@gnu.org, Public Domain 21094# August 1989 21095 21096@c endfile 21097@end ignore 21098@c file eg/prog/translate.awk 21099# Bugs: does not handle things like: tr A-Z a-z, it has 21100# to be spelled out. However, if `to' is shorter than `from', 21101# the last character in `to' is used for the rest of `from'. 21102 21103function stranslate(from, to, target, lf, lt, t_ar, i, c) 21104@{ 21105 lf = length(from) 21106 lt = length(to) 21107 for (i = 1; i <= lt; i++) 21108 t_ar[substr(from, i, 1)] = substr(to, i, 1) 21109 if (lt < lf) 21110 for (; i <= lf; i++) 21111 t_ar[substr(from, i, 1)] = substr(to, lt, 1) 21112 for (i = 1; i <= lf; i++) @{ 21113 c = substr(from, i, 1) 21114 if (index(target, c) > 0) 21115 gsub(c, t_ar[c], target) 21116 @} 21117 return target 21118@} 21119 21120function translate(from, to) 21121@{ 21122 return $0 = stranslate(from, to, $0) 21123@} 21124 21125# main program 21126BEGIN @{ 21127@group 21128 if (ARGC < 3) @{ 21129 print "usage: translate from to" > "/dev/stderr" 21130 exit 21131 @} 21132@end group 21133 FROM = ARGV[1] 21134 TO = ARGV[2] 21135 ARGC = 2 21136 ARGV[1] = "-" 21137@} 21138 21139@{ 21140 translate(FROM, TO) 21141 print 21142@} 21143@c endfile 21144@end example 21145 21146While it is possible to do character transliteration in a user-level 21147function, it is not necessarily efficient, and we (the @command{gawk} 21148authors) started to consider adding a built-in function. However, 21149shortly after writing this program, we learned that the System V Release 4 21150@command{awk} had added the @code{toupper} and @code{tolower} functions 21151(@pxref{String Functions}). 21152These functions handle the vast majority of the 21153cases where character transliteration is necessary, and so we chose to 21154simply add those functions to @command{gawk} as well and then leave well 21155enough alone. 21156 21157An obvious improvement to this program would be to set up the 21158@code{t_ar} array only once, in a @code{BEGIN} rule. However, this 21159assumes that the ``from'' and ``to'' lists 21160will never change throughout the lifetime of the program. 21161@c ENDOFRANGE chtra 21162 21163@node Labels Program 21164@subsection Printing Mailing Labels 21165 21166@c STARTOFRANGE prml 21167@cindex printing, mailing labels 21168@c comma is part of primary 21169@c STARTOFRANGE mlprint 21170@cindex mailing labels, printing 21171Here is a ``real world''@footnote{``Real world'' is defined as 21172``a program actually used to get something done.''} 21173program. This 21174script reads lists of names and 21175addresses and generates mailing labels. Each page of labels has 20 labels 21176on it, 2 across and 10 down. The addresses are guaranteed to be no more 21177than 5 lines of data. Each address is separated from the next by a blank 21178line. 21179 21180The basic idea is to read 20 labels worth of data. Each line of each label 21181is stored in the @code{line} array. The single rule takes care of filling 21182the @code{line} array and printing the page when 20 labels have been read. 21183 21184The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that 21185@command{awk} splits records at blank lines 21186(@pxref{Records}). 21187It sets @code{MAXLINES} to 100, since 100 is the maximum number 21188of lines on the page (20 * 5 = 100). 21189 21190Most of the work is done in the @code{printpage} function. 21191The label lines are stored sequentially in the @code{line} array. But they 21192have to print horizontally; @code{line[1]} next to @code{line[6]}, 21193@code{line[2]} next to @code{line[7]}, and so on. Two loops are used to 21194accomplish this. The outer loop, controlled by @code{i}, steps through 21195every 10 lines of data; this is each row of labels. The inner loop, 21196controlled by @code{j}, goes through the lines within the row. 21197As @code{j} goes from 0 to 4, @samp{i+j} is the @code{j}-th line in 21198the row, and @samp{i+j+5} is the entry next to it. The output ends up 21199looking something like this: 21200 21201@example 21202line 1 line 6 21203line 2 line 7 21204line 3 line 8 21205line 4 line 9 21206line 5 line 10 21207@dots{} 21208@end example 21209 21210As a final note, an extra blank line is printed at lines 21 and 61, to keep 21211the output lined up on the labels. This is dependent on the particular 21212brand of labels in use when the program was written. You will also note 21213that there are 2 blank lines at the top and 2 blank lines at the bottom. 21214 21215The @code{END} rule arranges to flush the final page of labels; there may 21216not have been an even multiple of 20 labels in the data: 21217 21218@cindex @code{labels.awk} program 21219@example 21220@c file eg/prog/labels.awk 21221# labels.awk --- print mailing labels 21222@c endfile 21223@ignore 21224@c file eg/prog/labels.awk 21225# 21226# Arnold Robbins, arnold@@gnu.org, Public Domain 21227# June 1992 21228@c endfile 21229@end ignore 21230@c file eg/prog/labels.awk 21231 21232# Each label is 5 lines of data that may have blank lines. 21233# The label sheets have 2 blank lines at the top and 2 at 21234# the bottom. 21235 21236BEGIN @{ RS = "" ; MAXLINES = 100 @} 21237 21238function printpage( i, j) 21239@{ 21240 if (Nlines <= 0) 21241 return 21242 21243 printf "\n\n" # header 21244 21245 for (i = 1; i <= Nlines; i += 10) @{ 21246 if (i == 21 || i == 61) 21247 print "" 21248 for (j = 0; j < 5; j++) @{ 21249 if (i + j > MAXLINES) 21250 break 21251 printf " %-41s %s\n", line[i+j], line[i+j+5] 21252 @} 21253 print "" 21254 @} 21255 21256 printf "\n\n" # footer 21257 21258 for (i in line) 21259 line[i] = "" 21260@} 21261 21262# main rule 21263@{ 21264 if (Count >= 20) @{ 21265 printpage() 21266 Count = 0 21267 Nlines = 0 21268 @} 21269 n = split($0, a, "\n") 21270 for (i = 1; i <= n; i++) 21271 line[++Nlines] = a[i] 21272 for (; i <= 5; i++) 21273 line[++Nlines] = "" 21274 Count++ 21275@} 21276 21277END \ 21278@{ 21279 printpage() 21280@} 21281@c endfile 21282@end example 21283@c ENDOFRANGE prml 21284@c ENDOFRANGE mlprint 21285 21286@node Word Sorting 21287@subsection Generating Word-Usage Counts 21288 21289@c last comma is part of secondary 21290@c STARTOFRANGE worus 21291@cindex words, usage counts, generating 21292@c NEXT ED: Rewrite this whole section and example 21293The following @command{awk} program prints 21294the number of occurrences of each word in its input. It illustrates the 21295associative nature of @command{awk} arrays by using strings as subscripts. It 21296also demonstrates the @samp{for @var{index} in @var{array}} mechanism. 21297Finally, it shows how @command{awk} is used in conjunction with other 21298utility programs to do a useful task of some complexity with a minimum of 21299effort. Some explanations follow the program listing: 21300 21301@example 21302# Print list of word frequencies 21303@{ 21304 for (i = 1; i <= NF; i++) 21305 freq[$i]++ 21306@} 21307 21308END @{ 21309 for (word in freq) 21310 printf "%s\t%d\n", word, freq[word] 21311@} 21312@end example 21313 21314@c Exercise: Use asort() here 21315 21316This program has two rules. The 21317first rule, because it has an empty pattern, is executed for every input line. 21318It uses @command{awk}'s field-accessing mechanism 21319(@pxref{Fields}) to pick out the individual words from 21320the line, and the built-in variable @code{NF} (@pxref{Built-in Variables}) 21321to know how many fields are available. 21322For each input word, it increments an element of the array @code{freq} to 21323reflect that the word has been seen an additional time. 21324 21325The second rule, because it has the pattern @code{END}, is not executed 21326until the input has been exhausted. It prints out the contents of the 21327@code{freq} table that has been built up inside the first action. 21328This program has several problems that would prevent it from being 21329useful by itself on real text files: 21330 21331@itemize @bullet 21332@item 21333Words are detected using the @command{awk} convention that fields are 21334separated just by whitespace. Other characters in the input (except 21335newlines) don't have any special meaning to @command{awk}. This means that 21336punctuation characters count as part of words. 21337 21338@item 21339The @command{awk} language considers upper- and lowercase characters to be 21340distinct. Therefore, ``bartender'' and ``Bartender'' are not treated 21341as the same word. This is undesirable, since in normal text, words 21342are capitalized if they begin sentences, and a frequency analyzer should not 21343be sensitive to capitalization. 21344 21345@item 21346The output does not come out in any useful order. You're more likely to be 21347interested in which words occur most frequently or in having an alphabetized 21348table of how frequently each word occurs. 21349@end itemize 21350 21351@cindex @command{sort} utility 21352The way to solve these problems is to use some of @command{awk}'s more advanced 21353features. First, we use @code{tolower} to remove 21354case distinctions. Next, we use @code{gsub} to remove punctuation 21355characters. Finally, we use the system @command{sort} utility to process the 21356output of the @command{awk} script. Here is the new version of 21357the program: 21358 21359@cindex @code{wordfreq.awk} program 21360@example 21361@c file eg/prog/wordfreq.awk 21362# wordfreq.awk --- print list of word frequencies 21363 21364@{ 21365 $0 = tolower($0) # remove case distinctions 21366 # remove punctuation 21367 gsub(/[^[:alnum:]_[:blank:]]/, "", $0) 21368 for (i = 1; i <= NF; i++) 21369 freq[$i]++ 21370@} 21371 21372END @{ 21373 for (word in freq) 21374 printf "%s\t%d\n", word, freq[word] 21375@} 21376@c endfile 21377@end example 21378 21379Assuming we have saved this program in a file named @file{wordfreq.awk}, 21380and that the data is in @file{file1}, the following pipeline: 21381 21382@example 21383awk -f wordfreq.awk file1 | sort -k 2nr 21384@end example 21385 21386@noindent 21387produces a table of the words appearing in @file{file1} in order of 21388decreasing frequency. The @command{awk} program suitably massages the 21389data and produces a word frequency table, which is not ordered. 21390 21391The @command{awk} script's output is then sorted by the @command{sort} 21392utility and printed on the terminal. The options given to @command{sort} 21393specify a sort that uses the second field of each input line (skipping 21394one field), that the sort keys should be treated as numeric quantities 21395(otherwise @samp{15} would come before @samp{5}), and that the sorting 21396should be done in descending (reverse) order. 21397 21398The @command{sort} could even be done from within the program, by changing 21399the @code{END} action to: 21400 21401@example 21402@c file eg/prog/wordfreq.awk 21403END @{ 21404 sort = "sort -k 2nr" 21405 for (word in freq) 21406 printf "%s\t%d\n", word, freq[word] | sort 21407 close(sort) 21408@} 21409@c endfile 21410@end example 21411 21412This way of sorting must be used on systems that do not 21413have true pipes at the command-line (or batch-file) level. 21414See the general operating system documentation for more information on how 21415to use the @command{sort} program. 21416@c ENDOFRANGE worus 21417 21418@node History Sorting 21419@subsection Removing Duplicates from Unsorted Text 21420 21421@c last comma is part of secondary 21422@c STARTOFRANGE lidu 21423@cindex lines, duplicate, removing 21424The @command{uniq} program 21425(@pxref{Uniq Program}), 21426removes duplicate lines from @emph{sorted} data. 21427 21428Suppose, however, you need to remove duplicate lines from a @value{DF} but 21429that you want to preserve the order the lines are in. A good example of 21430this might be a shell history file. The history file keeps a copy of all 21431the commands you have entered, and it is not unusual to repeat a command 21432several times in a row. Occasionally you might want to compact the history 21433by removing duplicate entries. Yet it is desirable to maintain the order 21434of the original commands. 21435 21436This simple program does the job. It uses two arrays. The @code{data} 21437array is indexed by the text of each line. 21438For each line, @code{data[$0]} is incremented. 21439If a particular line has not 21440been seen before, then @code{data[$0]} is zero. 21441In this case, the text of the line is stored in @code{lines[count]}. 21442Each element of @code{lines} is a unique command, and the indices of 21443@code{lines} indicate the order in which those lines are encountered. 21444The @code{END} rule simply prints out the lines, in order: 21445 21446@cindex Rakitzis, Byron 21447@cindex @code{histsort.awk} program 21448@example 21449@c file eg/prog/histsort.awk 21450# histsort.awk --- compact a shell history file 21451# Thanks to Byron Rakitzis for the general idea 21452@c endfile 21453@ignore 21454@c file eg/prog/histsort.awk 21455# 21456# Arnold Robbins, arnold@@gnu.org, Public Domain 21457# May 1993 21458 21459@c endfile 21460@end ignore 21461@c file eg/prog/histsort.awk 21462@group 21463@{ 21464 if (data[$0]++ == 0) 21465 lines[++count] = $0 21466@} 21467@end group 21468 21469END @{ 21470 for (i = 1; i <= count; i++) 21471 print lines[i] 21472@} 21473@c endfile 21474@end example 21475 21476This program also provides a foundation for generating other useful 21477information. For example, using the following @code{print} statement in the 21478@code{END} rule indicates how often a particular command is used: 21479 21480@example 21481print data[lines[i]], lines[i] 21482@end example 21483 21484This works because @code{data[$0]} is incremented each time a line is 21485seen. 21486@c ENDOFRANGE lidu 21487 21488@node Extract Program 21489@subsection Extracting Programs from Texinfo Source Files 21490 21491@c STARTOFRANGE texse 21492@cindex Texinfo, extracting programs from source files 21493@c last comma is part of secondary 21494@c STARTOFRANGE fitex 21495@cindex files, Texinfo, extracting programs from 21496@ifnotinfo 21497Both this chapter and the previous chapter 21498(@ref{Library Functions}) 21499present a large number of @command{awk} programs. 21500@end ifnotinfo 21501@ifinfo 21502The nodes 21503@ref{Library Functions}, 21504and @ref{Sample Programs}, 21505are the top level nodes for a large number of @command{awk} programs. 21506@end ifinfo 21507If you want to experiment with these programs, it is tedious to have to type 21508them in by hand. Here we present a program that can extract parts of a 21509Texinfo input file into separate files. 21510 21511@cindex Texinfo 21512This @value{DOCUMENT} is written in Texinfo, the GNU project's document 21513formatting 21514language. 21515A single Texinfo source file can be used to produce both 21516printed and online documentation. 21517@ifnotinfo 21518Texinfo is fully documented in the book 21519@cite{Texinfo---The GNU Documentation Format}, 21520available from the Free Software Foundation. 21521@end ifnotinfo 21522@ifinfo 21523The Texinfo language is described fully, starting with 21524@ref{Top}. 21525@end ifinfo 21526 21527For our purposes, it is enough to know three things about Texinfo input 21528files: 21529 21530@itemize @bullet 21531@item 21532The ``at'' symbol (@samp{@@}) is special in Texinfo, much as 21533the backslash (@samp{\}) is in C 21534or @command{awk}. Literal @samp{@@} symbols are represented in Texinfo source 21535files as @samp{@@@@}. 21536 21537@item 21538Comments start with either @samp{@@c} or @samp{@@comment}. 21539The file-extraction program works by using special comments that start 21540at the beginning of a line. 21541 21542@item 21543Lines containing @samp{@@group} and @samp{@@end group} commands bracket 21544example text that should not be split across a page boundary. 21545(Unfortunately, @TeX{} isn't always smart enough to do things exactly right, 21546and we have to give it some help.) 21547@end itemize 21548 21549The following program, @file{extract.awk}, reads through a Texinfo source 21550file and does two things, based on the special comments. 21551Upon seeing @samp{@w{@@c system @dots{}}}, 21552it runs a command, by extracting the command text from the 21553control line and passing it on to the @code{system} function 21554(@pxref{I/O Functions}). 21555Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to 21556the file @var{filename}, until @samp{@@c endfile} is encountered. 21557The rules in @file{extract.awk} match either @samp{@@c} or 21558@samp{@@comment} by letting the @samp{omment} part be optional. 21559Lines containing @samp{@@group} and @samp{@@end group} are simply removed. 21560@file{extract.awk} uses the @code{join} library function 21561(@pxref{Join Function}). 21562 21563The example programs in the online Texinfo source for @cite{@value{TITLE}} 21564(@file{gawk.texi}) have all been bracketed inside @samp{file} and 21565@samp{endfile} lines. The @command{gawk} distribution uses a copy of 21566@file{extract.awk} to extract the sample programs and install many 21567of them in a standard directory where @command{gawk} can find them. 21568The Texinfo file looks something like this: 21569 21570@example 21571@dots{} 21572This program has a @@code@{BEGIN@} rule, 21573that prints a nice message: 21574 21575@@example 21576@@c file examples/messages.awk 21577BEGIN @@@{ print "Don't panic!" @@@} 21578@@c end file 21579@@end example 21580 21581It also prints some final advice: 21582 21583@@example 21584@@c file examples/messages.awk 21585END @@@{ print "Always avoid bored archeologists!" @@@} 21586@@c end file 21587@@end example 21588@dots{} 21589@end example 21590 21591@file{extract.awk} begins by setting @code{IGNORECASE} to one, so that 21592mixed upper- and lowercase letters in the directives won't matter. 21593 21594The first rule handles calling @code{system}, checking that a command is 21595given (@code{NF} is at least three) and also checking that the command 21596exits with a zero exit status, signifying OK: 21597 21598@cindex @code{extract.awk} program 21599@example 21600@c file eg/prog/extract.awk 21601# extract.awk --- extract files and run programs 21602# from texinfo files 21603@c endfile 21604@ignore 21605@c file eg/prog/extract.awk 21606# 21607# Arnold Robbins, arnold@@gnu.org, Public Domain 21608# May 1993 21609# Revised September 2000 21610 21611@c endfile 21612@end ignore 21613@c file eg/prog/extract.awk 21614BEGIN @{ IGNORECASE = 1 @} 21615 21616/^@@c(omment)?[ \t]+system/ \ 21617@{ 21618 if (NF < 3) @{ 21619 e = (FILENAME ":" FNR) 21620 e = (e ": badly formed `system' line") 21621 print e > "/dev/stderr" 21622 next 21623 @} 21624 $1 = "" 21625 $2 = "" 21626 stat = system($0) 21627 if (stat != 0) @{ 21628 e = (FILENAME ":" FNR) 21629 e = (e ": warning: system returned " stat) 21630 print e > "/dev/stderr" 21631 @} 21632@} 21633@c endfile 21634@end example 21635 21636@noindent 21637The variable @code{e} is used so that the function 21638fits nicely on the 21639@ifnotinfo 21640page. 21641@end ifnotinfo 21642@ifnottex 21643screen. 21644@end ifnottex 21645 21646The second rule handles moving data into files. It verifies that a 21647@value{FN} is given in the directive. If the file named is not the 21648current file, then the current file is closed. Keeping the current file 21649open until a new file is encountered allows the use of the @samp{>} 21650redirection for printing the contents, keeping open file management 21651simple. 21652 21653The @samp{for} loop does the work. It reads lines using @code{getline} 21654(@pxref{Getline}). 21655For an unexpected end of file, it calls the @code{@w{unexpected_eof}} 21656function. If the line is an ``endfile'' line, then it breaks out of 21657the loop. 21658If the line is an @samp{@@group} or @samp{@@end group} line, then it 21659ignores it and goes on to the next line. 21660Similarly, comments within examples are also ignored. 21661 21662Most of the work is in the following few lines. If the line has no @samp{@@} 21663symbols, the program can print it directly. 21664Otherwise, each leading @samp{@@} must be stripped off. 21665To remove the @samp{@@} symbols, the line is split into separate elements of 21666the array @code{a}, using the @code{split} function 21667(@pxref{String Functions}). 21668The @samp{@@} symbol is used as the separator character. 21669Each element of @code{a} that is empty indicates two successive @samp{@@} 21670symbols in the original line. For each two empty elements (@samp{@@@@} in 21671the original file), we have to add a single @samp{@@} symbol back in. 21672 21673When the processing of the array is finished, @code{join} is called with the 21674value of @code{SUBSEP}, to rejoin the pieces back into a single 21675line. That line is then printed to the output file: 21676 21677@example 21678@c file eg/prog/extract.awk 21679/^@@c(omment)?[ \t]+file/ \ 21680@{ 21681 if (NF != 3) @{ 21682 e = (FILENAME ":" FNR ": badly formed `file' line") 21683 print e > "/dev/stderr" 21684 next 21685 @} 21686 if ($3 != curfile) @{ 21687 if (curfile != "") 21688 close(curfile) 21689 curfile = $3 21690 @} 21691 21692 for (;;) @{ 21693 if ((getline line) <= 0) 21694 unexpected_eof() 21695 if (line ~ /^@@c(omment)?[ \t]+endfile/) 21696 break 21697 else if (line ~ /^@@(end[ \t]+)?group/) 21698 continue 21699 else if (line ~ /^@@c(omment+)?[ \t]+/) 21700 continue 21701 if (index(line, "@@") == 0) @{ 21702 print line > curfile 21703 continue 21704 @} 21705 n = split(line, a, "@@") 21706 # if a[1] == "", means leading @@, 21707 # don't add one back in. 21708 for (i = 2; i <= n; i++) @{ 21709 if (a[i] == "") @{ # was an @@@@ 21710 a[i] = "@@" 21711 if (a[i+1] == "") 21712 i++ 21713 @} 21714 @} 21715 print join(a, 1, n, SUBSEP) > curfile 21716 @} 21717@} 21718@c endfile 21719@end example 21720 21721An important thing to note is the use of the @samp{>} redirection. 21722Output done with @samp{>} only opens the file once; it stays open and 21723subsequent output is appended to the file 21724(@pxref{Redirection}). 21725This makes it easy to mix program text and explanatory prose for the same 21726sample source file (as has been done here!) without any hassle. The file is 21727only closed when a new data @value{FN} is encountered or at the end of the 21728input file. 21729 21730Finally, the function @code{@w{unexpected_eof}} prints an appropriate 21731error message and then exits. 21732The @code{END} rule handles the final cleanup, closing the open file: 21733 21734@c function lb put on same line for page breaking. sigh 21735@example 21736@c file eg/prog/extract.awk 21737@group 21738function unexpected_eof() @{ 21739 printf("%s:%d: unexpected EOF or error\n", 21740 FILENAME, FNR) > "/dev/stderr" 21741 exit 1 21742@} 21743@end group 21744 21745END @{ 21746 if (curfile) 21747 close(curfile) 21748@} 21749@c endfile 21750@end example 21751@c ENDOFRANGE texse 21752@c ENDOFRANGE fitex 21753 21754@node Simple Sed 21755@subsection A Simple Stream Editor 21756 21757@cindex @command{sed} utility 21758@cindex stream editors 21759The @command{sed} utility is a stream editor, a program that reads a 21760stream of data, makes changes to it, and passes it on. 21761It is often used to make global changes to a large file or to a stream 21762of data generated by a pipeline of commands. 21763While @command{sed} is a complicated program in its own right, its most common 21764use is to perform global substitutions in the middle of a pipeline: 21765 21766@example 21767command1 < orig.data | sed 's/old/new/g' | command2 > result 21768@end example 21769 21770Here, @samp{s/old/new/g} tells @command{sed} to look for the regexp 21771@samp{old} on each input line and globally replace it with the text 21772@samp{new}, i.e., all the occurrences on a line. This is similar to 21773@command{awk}'s @code{gsub} function 21774(@pxref{String Functions}). 21775 21776The following program, @file{awksed.awk}, accepts at least two command-line 21777arguments: the pattern to look for and the text to replace it with. Any 21778additional arguments are treated as data @value{FN}s to process. If none 21779are provided, the standard input is used: 21780 21781@cindex Brennan, Michael 21782@cindex @command{awksed.awk} program 21783@c @cindex simple stream editor 21784@c @cindex stream editor, simple 21785@example 21786@c file eg/prog/awksed.awk 21787# awksed.awk --- do s/foo/bar/g using just print 21788# Thanks to Michael Brennan for the idea 21789@c endfile 21790@ignore 21791@c file eg/prog/awksed.awk 21792# 21793# Arnold Robbins, arnold@@gnu.org, Public Domain 21794# August 1995 21795 21796@c endfile 21797@end ignore 21798@c file eg/prog/awksed.awk 21799function usage() 21800@{ 21801 print "usage: awksed pat repl [files...]" > "/dev/stderr" 21802 exit 1 21803@} 21804 21805BEGIN @{ 21806 # validate arguments 21807 if (ARGC < 3) 21808 usage() 21809 21810 RS = ARGV[1] 21811 ORS = ARGV[2] 21812 21813 # don't use arguments as files 21814 ARGV[1] = ARGV[2] = "" 21815@} 21816 21817@group 21818# look ma, no hands! 21819@{ 21820 if (RT == "") 21821 printf "%s", $0 21822 else 21823 print 21824@} 21825@end group 21826@c endfile 21827@end example 21828 21829The program relies on @command{gawk}'s ability to have @code{RS} be a regexp, 21830as well as on the setting of @code{RT} to the actual text that terminates the 21831record (@pxref{Records}). 21832 21833The idea is to have @code{RS} be the pattern to look for. @command{gawk} 21834automatically sets @code{$0} to the text between matches of the pattern. 21835This is text that we want to keep, unmodified. Then, by setting @code{ORS} 21836to the replacement text, a simple @code{print} statement outputs the 21837text we want to keep, followed by the replacement text. 21838 21839There is one wrinkle to this scheme, which is what to do if the last record 21840doesn't end with text that matches @code{RS}. Using a @code{print} 21841statement unconditionally prints the replacement text, which is not correct. 21842However, if the file did not end in text that matches @code{RS}, @code{RT} 21843is set to the null string. In this case, we can print @code{$0} using 21844@code{printf} 21845(@pxref{Printf}). 21846 21847The @code{BEGIN} rule handles the setup, checking for the right number 21848of arguments and calling @code{usage} if there is a problem. Then it sets 21849@code{RS} and @code{ORS} from the command-line arguments and sets 21850@code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they are 21851not treated as @value{FN}s 21852(@pxref{ARGC and ARGV}). 21853 21854The @code{usage} function prints an error message and exits. 21855Finally, the single rule handles the printing scheme outlined above, 21856using @code{print} or @code{printf} as appropriate, depending upon the 21857value of @code{RT}. 21858 21859@ignore 21860Exercise, compare the performance of this version with the more 21861straightforward: 21862 21863BEGIN { 21864 pat = ARGV[1] 21865 repl = ARGV[2] 21866 ARGV[1] = ARGV[2] = "" 21867} 21868 21869{ gsub(pat, repl); print } 21870 21871Exercise: what are the advantages and disadvantages of this version versus sed? 21872 Advantage: egrep regexps 21873 speed (?) 21874 Disadvantage: no & in replacement text 21875 21876Others? 21877@end ignore 21878 21879@node Igawk Program 21880@subsection An Easy Way to Use Library Functions 21881 21882@c STARTOFRANGE libfex 21883@cindex libraries of @command{awk} functions, example program for using 21884@c STARTOFRANGE flibex 21885@cindex functions, library, example program for using 21886Using library functions in @command{awk} can be very beneficial. It 21887encourages code reuse and the writing of general functions. Programs are 21888smaller and therefore clearer. 21889However, using library functions is only easy when writing @command{awk} 21890programs; it is painful when running them, requiring multiple @option{-f} 21891options. If @command{gawk} is unavailable, then so too is the @env{AWKPATH} 21892environment variable and the ability to put @command{awk} functions into a 21893library directory (@pxref{Options}). 21894It would be nice to be able to write programs in the following manner: 21895 21896@example 21897# library functions 21898@@include getopt.awk 21899@@include join.awk 21900@dots{} 21901 21902# main program 21903BEGIN @{ 21904 while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1) 21905 @dots{} 21906 @dots{} 21907@} 21908@end example 21909 21910The following program, @file{igawk.sh}, provides this service. 21911It simulates @command{gawk}'s searching of the @env{AWKPATH} variable 21912and also allows @dfn{nested} includes; i.e., a file that is included 21913with @samp{@@include} can contain further @samp{@@include} statements. 21914@command{igawk} makes an effort to only include files once, so that nested 21915includes don't accidentally include a library function twice. 21916 21917@command{igawk} should behave just like @command{gawk} externally. This 21918means it should accept all of @command{gawk}'s command-line arguments, 21919including the ability to have multiple source files specified via 21920@option{-f}, and the ability to mix command-line and library source files. 21921 21922The program is written using the POSIX Shell (@command{sh}) command 21923language.@footnote{Fully explaining the @command{sh} language is beyond 21924the scope of this book. We provide some minimal explanations, but see 21925a good shell programming book if you wish to understand things in more 21926depth.} It works as follows: 21927 21928@enumerate 21929@item 21930Loop through the arguments, saving anything that doesn't represent 21931@command{awk} source code for later, when the expanded program is run. 21932 21933@item 21934For any arguments that do represent @command{awk} text, put the arguments into 21935a shell variable that will be expanded. There are two cases: 21936 21937@enumerate a 21938@item 21939Literal text, provided with @option{--source} or @option{--source=}. This 21940text is just appended directly. 21941 21942@item 21943Source @value{FN}s, provided with @option{-f}. We use a neat trick and append 21944@samp{@@include @var{filename}} to the shell variable's contents. Since the file-inclusion 21945program works the way @command{gawk} does, this gets the text 21946of the file included into the program at the correct point. 21947@end enumerate 21948 21949@item 21950Run an @command{awk} program (naturally) over the shell variable's contents to expand 21951@samp{@@include} statements. The expanded program is placed in a second 21952shell variable. 21953 21954@item 21955Run the expanded program with @command{gawk} and any other original command-line 21956arguments that the user supplied (such as the data @value{FN}s). 21957@end enumerate 21958 21959This program uses shell variables extensively; for storing command line arguments, 21960the text of the @command{awk} program that will expand the user's program, for the 21961user's original program, and for the expanded program. Doing so removes some 21962potential problems that might arise were we to use temporary files instead, 21963at the cost of making the script somewhat more complicated. 21964 21965The initial part of the program turns on shell tracing if the first 21966argument is @samp{debug}. 21967 21968The next part loops through all the command-line arguments. 21969There are several cases of interest: 21970 21971@table @code 21972@item -- 21973This ends the arguments to @command{igawk}. Anything else should be passed on 21974to the user's @command{awk} program without being evaluated. 21975 21976@item -W 21977This indicates that the next option is specific to @command{gawk}. To make 21978argument processing easier, the @option{-W} is appended to the front of the 21979remaining arguments and the loop continues. (This is an @command{sh} 21980programming trick. Don't worry about it if you are not familiar with 21981@command{sh}.) 21982 21983@item -v@r{,} -F 21984These are saved and passed on to @command{gawk}. 21985 21986@item -f@r{,} --file@r{,} --file=@r{,} -Wfile= 21987The @value{FN} is appended to the shell variable @code{program} with an 21988@samp{@@include} statement. 21989The @command{expr} utility is used to remove the leading option part of the 21990argument (e.g., @samp{--file=}). 21991(Typical @command{sh} usage would be to use the @command{echo} and @command{sed} 21992utilities to do this work. Unfortunately, some versions of @command{echo} evaluate 21993escape sequences in their arguments, possibly mangling the program text. 21994Using @command{expr} avoids this problem.) 21995 21996@item --source@r{,} --source=@r{,} -Wsource= 21997The source text is appended to @code{program}. 21998 21999@item --version@r{,} -Wversion 22000@command{igawk} prints its version number, runs @samp{gawk --version} 22001to get the @command{gawk} version information, and then exits. 22002@end table 22003 22004If none of the @option{-f}, @option{--file}, @option{-Wfile}, @option{--source}, 22005or @option{-Wsource} arguments are supplied, then the first nonoption argument 22006should be the @command{awk} program. If there are no command-line 22007arguments left, @command{igawk} prints an error message and exits. 22008Otherwise, the first argument is appended to @code{program}. 22009In any case, after the arguments have been processed, 22010@code{program} contains the complete text of the original @command{awk} 22011program. 22012 22013The program is as follows: 22014 22015@cindex @code{igawk.sh} program 22016@example 22017@c file eg/prog/igawk.sh 22018#! /bin/sh 22019# igawk --- like gawk but do @@include processing 22020@c endfile 22021@ignore 22022@c file eg/prog/igawk.sh 22023# 22024# Arnold Robbins, arnold@@gnu.org, Public Domain 22025# July 1993 22026 22027@c endfile 22028@end ignore 22029@c file eg/prog/igawk.sh 22030if [ "$1" = debug ] 22031then 22032 set -x 22033 shift 22034fi 22035 22036# A literal newline, so that program text is formmatted correctly 22037n=' 22038' 22039 22040# Initialize variables to empty 22041program= 22042opts= 22043 22044while [ $# -ne 0 ] # loop over arguments 22045do 22046 case $1 in 22047 --) shift; break;; 22048 22049 -W) shift 22050 # The $@{x?'message here'@} construct prints a 22051 # diagnostic if $x is the null string 22052 set -- -W"$@{@@?'missing operand'@}" 22053 continue;; 22054 22055 -[vF]) opts="$opts $1 '$@{2?'missing operand'@}'" 22056 shift;; 22057 22058 -[vF]*) opts="$opts '$1'" ;; 22059 22060 -f) program="$program$n@@include $@{2?'missing operand'@}" 22061 shift;; 22062 22063 -f*) f=`expr "$1" : '-f\(.*\)'` 22064 program="$program$n@@include $f";; 22065 22066 -[W-]file=*) 22067 f=`expr "$1" : '-.file=\(.*\)'` 22068 program="$program$n@@include $f";; 22069 22070 -[W-]file) 22071 program="$program$n@@include $@{2?'missing operand'@}" 22072 shift;; 22073 22074 -[W-]source=*) 22075 t=`expr "$1" : '-.source=\(.*\)'` 22076 program="$program$n$t";; 22077 22078 -[W-]source) 22079 program="$program$n$@{2?'missing operand'@}" 22080 shift;; 22081 22082 -[W-]version) 22083 echo igawk: version 2.0 1>&2 22084 gawk --version 22085 exit 0 ;; 22086 22087 -[W-]*) opts="$opts '$1'" ;; 22088 22089 *) break;; 22090 esac 22091 shift 22092done 22093 22094if [ -z "$program" ] 22095then 22096 program=$@{1?'missing program'@} 22097 shift 22098fi 22099 22100# At this point, `program' has the program. 22101@c endfile 22102@end example 22103 22104The @command{awk} program to process @samp{@@include} directives 22105is stored in the shell variable @code{expand_prog}. Doing this keeps 22106the shell script readable. The @command{awk} program 22107reads through the user's program, one line at a time, using @code{getline} 22108(@pxref{Getline}). The input 22109@value{FN}s and @samp{@@include} statements are managed using a stack. 22110As each @samp{@@include} is encountered, the current @value{FN} is 22111``pushed'' onto the stack and the file named in the @samp{@@include} 22112directive becomes the current @value{FN}. As each file is finished, 22113the stack is ``popped,'' and the previous input file becomes the current 22114input file again. The process is started by making the original file 22115the first one on the stack. 22116 22117The @code{pathto} function does the work of finding the full path to 22118a file. It simulates @command{gawk}'s behavior when searching the 22119@env{AWKPATH} environment variable 22120(@pxref{AWKPATH Variable}). 22121If a @value{FN} has a @samp{/} in it, no path search is done. Otherwise, 22122the @value{FN} is concatenated with the name of each directory in 22123the path, and an attempt is made to open the generated @value{FN}. 22124The only way to test if a file can be read in @command{awk} is to go 22125ahead and try to read it with @code{getline}; this is what @code{pathto} 22126does.@footnote{On some very old versions of @command{awk}, the test 22127@samp{getline junk < t} can loop forever if the file exists but is empty. 22128Caveat emptor.} If the file can be read, it is closed and the @value{FN} 22129is returned: 22130 22131@ignore 22132An alternative way to test for the file's existence would be to call 22133@samp{system("test -r " t)}, which uses the @command{test} utility to 22134see if the file exists and is readable. The disadvantage to this method 22135is that it requires creating an extra process and can thus be slightly 22136slower. 22137@end ignore 22138 22139@example 22140@c file eg/prog/igawk.sh 22141expand_prog=' 22142 22143function pathto(file, i, t, junk) 22144@{ 22145 if (index(file, "/") != 0) 22146 return file 22147 22148 for (i = 1; i <= ndirs; i++) @{ 22149 t = (pathlist[i] "/" file) 22150@group 22151 if ((getline junk < t) > 0) @{ 22152 # found it 22153 close(t) 22154 return t 22155 @} 22156@end group 22157 @} 22158 return "" 22159@} 22160@c endfile 22161@end example 22162 22163The main program is contained inside one @code{BEGIN} rule. The first thing it 22164does is set up the @code{pathlist} array that @code{pathto} uses. After 22165splitting the path on @samp{:}, null elements are replaced with @code{"."}, 22166which represents the current directory: 22167 22168@example 22169@c file eg/prog/igawk.sh 22170BEGIN @{ 22171 path = ENVIRON["AWKPATH"] 22172 ndirs = split(path, pathlist, ":") 22173 for (i = 1; i <= ndirs; i++) @{ 22174 if (pathlist[i] == "") 22175 pathlist[i] = "." 22176 @} 22177@c endfile 22178@end example 22179 22180The stack is initialized with @code{ARGV[1]}, which will be @file{/dev/stdin}. 22181The main loop comes next. Input lines are read in succession. Lines that 22182do not start with @samp{@@include} are printed verbatim. 22183If the line does start with @samp{@@include}, the @value{FN} is in @code{$2}. 22184@code{pathto} is called to generate the full path. If it cannot, then we 22185print an error message and continue. 22186 22187The next thing to check is if the file is included already. The 22188@code{processed} array is indexed by the full @value{FN} of each included 22189file and it tracks this information for us. If the file is 22190seen again, a warning message is printed. Otherwise, the new @value{FN} is 22191pushed onto the stack and processing continues. 22192 22193Finally, when @code{getline} encounters the end of the input file, the file 22194is closed and the stack is popped. When @code{stackptr} is less than zero, 22195the program is done: 22196 22197@example 22198@c file eg/prog/igawk.sh 22199 stackptr = 0 22200 input[stackptr] = ARGV[1] # ARGV[1] is first file 22201 22202 for (; stackptr >= 0; stackptr--) @{ 22203 while ((getline < input[stackptr]) > 0) @{ 22204 if (tolower($1) != "@@include") @{ 22205 print 22206 continue 22207 @} 22208 fpath = pathto($2) 22209@group 22210 if (fpath == "") @{ 22211 printf("igawk:%s:%d: cannot find %s\n", 22212 input[stackptr], FNR, $2) > "/dev/stderr" 22213 continue 22214 @} 22215@end group 22216 if (! (fpath in processed)) @{ 22217 processed[fpath] = input[stackptr] 22218 input[++stackptr] = fpath # push onto stack 22219 @} else 22220 print $2, "included in", input[stackptr], 22221 "already included in", 22222 processed[fpath] > "/dev/stderr" 22223 @} 22224 close(input[stackptr]) 22225 @} 22226@}' # close quote ends `expand_prog' variable 22227 22228processed_program=`gawk -- "$expand_prog" /dev/stdin <<EOF 22229$program 22230EOF 22231` 22232@c endfile 22233@end example 22234 22235The shell construct @samp{@var{command} << @var{marker}} is called a @dfn{here document}. 22236Everything in the shell script up to the @var{marker} is fed to @var{command} as input. 22237The shell processes the contents of the here document for variable and command substitution 22238(and possibly other things as well, depending upon the shell). 22239 22240The shell construct @samp{`@dots{}`} is called @dfn{command substitution}. 22241The output of the command between the two backquotes (grave accents) is substituted 22242into the command line. It is saved as a single string, even if the results 22243contain whitespace. 22244 22245The expanded program is saved in the variable @code{processed_program}. 22246It's done in these steps: 22247 22248@enumerate 22249@item 22250Run @command{gawk} with the @samp{@@include}-processing program (the 22251value of the @code{expand_prog} shell variable) on standard input. 22252 22253@item 22254Standard input is the contents of the user's program, from the shell variable @code{program}. 22255Its contents are fed to @command{gawk} via a here document. 22256 22257@item 22258The results of this processing are saved in the shell variable @code{processed_program} by using command substitution. 22259@end enumerate 22260 22261The last step is to call @command{gawk} with the expanded program, 22262along with the original 22263options and command-line arguments that the user supplied. 22264 22265@c this causes more problems than it solves, so leave it out. 22266@ignore 22267The special file @file{/dev/null} is passed as a @value{DF} to @command{gawk} 22268to handle an interesting case. Suppose that the user's program only has 22269a @code{BEGIN} rule and there are no @value{DF}s to read. 22270The program should exit without reading any @value{DF}s. 22271However, suppose that an included library file defines an @code{END} 22272rule of its own. In this case, @command{gawk} will hang, reading standard 22273input. In order to avoid this, @file{/dev/null} is explicitly added to the 22274command-line. Reading from @file{/dev/null} always returns an immediate 22275end of file indication. 22276 22277@c Hmm. Add /dev/null if $# is 0? Still messes up ARGV. Sigh. 22278@end ignore 22279 22280@example 22281@c file eg/prog/igawk.sh 22282eval gawk $opts -- '"$processed_program"' '"$@@"' 22283@c endfile 22284@end example 22285 22286The @command{eval} command is a shell construct that reruns the shell's parsing 22287process. This keeps things properly quoted. 22288 22289This version of @command{igawk} represents my fourth attempt at this program. 22290There are four key simplifications that make the program work better: 22291 22292@itemize @bullet 22293@item 22294Using @samp{@@include} even for the files named with @option{-f} makes building 22295the initial collected @command{awk} program much simpler; all the 22296@samp{@@include} processing can be done once. 22297 22298@item 22299Not trying to save the line read with @code{getline} 22300in the @code{pathto} function when testing for the 22301file's accessibility for use with the main program simplifies things 22302considerably. 22303@c what problem does this engender though - exercise 22304@c answer, reading from "-" or /dev/stdin 22305 22306@item 22307Using a @code{getline} loop in the @code{BEGIN} rule does it all in one 22308place. It is not necessary to call out to a separate loop for processing 22309nested @samp{@@include} statements. 22310 22311@item 22312Instead of saving the expanded program in a temporary file, putting it in a shell variable 22313avoids some potential security problems. 22314This has the disadvantage that the script relies upon more features 22315of the @command{sh} language, making it harder to follow for those who 22316aren't familiar with @command{sh}. 22317@end itemize 22318 22319Also, this program illustrates that it is often worthwhile to combine 22320@command{sh} and @command{awk} programming together. You can usually 22321accomplish quite a lot, without having to resort to low-level programming 22322in C or C++, and it is frequently easier to do certain kinds of string 22323and argument manipulation using the shell than it is in @command{awk}. 22324 22325Finally, @command{igawk} shows that it is not always necessary to add new 22326features to a program; they can often be layered on top. With @command{igawk}, 22327there is no real reason to build @samp{@@include} processing into 22328@command{gawk} itself. 22329 22330@cindex search paths, for source files 22331@c comma is part of primary 22332@cindex source files, search path for 22333@c last comma is part of secondary 22334@cindex files, source, search path for 22335@cindex directories, searching 22336As an additional example of this, consider the idea of having two 22337files in a directory in the search path: 22338 22339@table @file 22340@item default.awk 22341This file contains a set of default library functions, such 22342as @code{getopt} and @code{assert}. 22343 22344@item site.awk 22345This file contains library functions that are specific to a site or 22346installation; i.e., locally developed functions. 22347Having a separate file allows @file{default.awk} to change with 22348new @command{gawk} releases, without requiring the system administrator to 22349update it each time by adding the local functions. 22350@end table 22351 22352One user 22353@c Karl Berry, karl@ileaf.com, 10/95 22354suggested that @command{gawk} be modified to automatically read these files 22355upon startup. Instead, it would be very simple to modify @command{igawk} 22356to do this. Since @command{igawk} can process nested @samp{@@include} 22357directives, @file{default.awk} could simply contain @samp{@@include} 22358statements for the desired library functions. 22359@c ENDOFRANGE libfex 22360@c ENDOFRANGE flibex 22361@c ENDOFRANGE awkpex 22362 22363@c Exercise: make this change 22364 22365@ignore 22366@c Try this 22367@iftex 22368@page 22369@headings off 22370@majorheading III@ @ @ Appendixes 22371Part III provides the appendixes, the Glossary, and two licenses that cover 22372the @command{gawk} source code and this @value{DOCUMENT}, respectively. 22373It contains the following appendixes: 22374 22375@itemize @bullet 22376@item 22377@ref{Language History}. 22378 22379@item 22380@ref{Installation}. 22381 22382@item 22383@ref{Notes}. 22384 22385@item 22386@ref{Basic Concepts}. 22387 22388@item 22389@ref{Glossary}. 22390 22391@item 22392@ref{Copying}. 22393 22394@item 22395@ref{GNU Free Documentation License}. 22396@end itemize 22397 22398@page 22399@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| 22400@oddheading @| @| @strong{@thischapter}@ @ @ @thispage 22401@end iftex 22402@end ignore 22403 22404@node Language History 22405@appendix The Evolution of the @command{awk} Language 22406 22407This @value{DOCUMENT} describes the GNU implementation of @command{awk}, which follows 22408the POSIX specification. 22409Many long-time @command{awk} users learned @command{awk} programming 22410with the original @command{awk} implementation in Version 7 Unix. 22411(This implementation was the basis for @command{awk} in Berkeley Unix, 22412through 4.3-Reno. Subsequent versions of Berkeley Unix, and systems 22413derived from 4.4BSD-Lite, use various versions of @command{gawk} 22414for their @command{awk}.) 22415This @value{CHAPTER} briefly describes the 22416evolution of the @command{awk} language, with cross-references to other parts 22417of the @value{DOCUMENT} where you can find more information. 22418 22419@menu 22420* V7/SVR3.1:: The major changes between V7 and System V 22421 Release 3.1. 22422* SVR4:: Minor changes between System V Releases 3.1 22423 and 4. 22424* POSIX:: New features from the POSIX standard. 22425* BTL:: New features from the Bell Laboratories 22426 version of @command{awk}. 22427* POSIX/GNU:: The extensions in @command{gawk} not in POSIX 22428 @command{awk}. 22429* Contributors:: The major contributors to @command{gawk}. 22430@end menu 22431 22432@node V7/SVR3.1 22433@appendixsec Major Changes Between V7 and SVR3.1 22434@c STARTOFRANGE gawkv 22435@cindex @command{awk}, versions of 22436@c STARTOFRANGE gawkv1 22437@cindex @command{awk}, versions of, changes between V7 and SVR3.1 22438 22439The @command{awk} language evolved considerably between the release of 22440Version 7 Unix (1978) and the new version that was first made generally available in 22441System V Release 3.1 (1987). This @value{SECTION} summarizes the changes, with 22442cross-references to further details: 22443 22444@itemize @bullet 22445@item 22446The requirement for @samp{;} to separate rules on a line 22447(@pxref{Statements/Lines}). 22448 22449@item 22450User-defined functions and the @code{return} statement 22451(@pxref{User-defined}). 22452 22453@item 22454The @code{delete} statement (@pxref{Delete}). 22455 22456@item 22457The @code{do}-@code{while} statement 22458(@pxref{Do Statement}). 22459 22460@item 22461The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand}, and 22462@code{srand} (@pxref{Numeric Functions}). 22463 22464@item 22465The built-in functions @code{gsub}, @code{sub}, and @code{match} 22466(@pxref{String Functions}). 22467 22468@item 22469The built-in functions @code{close} and @code{system} 22470(@pxref{I/O Functions}). 22471 22472@item 22473The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART}, 22474and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}). 22475 22476@item 22477The conditional expression using the ternary operator @samp{?:} 22478(@pxref{Conditional Exp}). 22479 22480@item 22481The exponentiation operator @samp{^} 22482(@pxref{Arithmetic Ops}) and its assignment operator 22483form @samp{^=} (@pxref{Assignment Ops}). 22484 22485@item 22486C-compatible operator precedence, which breaks some old @command{awk} 22487programs (@pxref{Precedence}). 22488 22489@item 22490Regexps as the value of @code{FS} 22491(@pxref{Field Separators}) and as the 22492third argument to the @code{split} function 22493(@pxref{String Functions}). 22494 22495@item 22496Dynamic regexps as operands of the @samp{~} and @samp{!~} operators 22497(@pxref{Regexp Usage}). 22498 22499@item 22500The escape sequences @samp{\b}, @samp{\f}, and @samp{\r} 22501(@pxref{Escape Sequences}). 22502(Some vendors have updated their old versions of @command{awk} to 22503recognize @samp{\b}, @samp{\f}, and @samp{\r}, but this is not 22504something you can rely on.) 22505 22506@item 22507Redirection of input for the @code{getline} function 22508(@pxref{Getline}). 22509 22510@item 22511Multiple @code{BEGIN} and @code{END} rules 22512(@pxref{BEGIN/END}). 22513 22514@item 22515Multidimensional arrays 22516(@pxref{Multi-dimensional}). 22517@end itemize 22518@c ENDOFRANGE gawkv1 22519 22520@node SVR4 22521@appendixsec Changes Between SVR3.1 and SVR4 22522 22523@cindex @command{awk}, versions of, changes between SVR3.1 and SVR4 22524The System V Release 4 (1989) version of Unix @command{awk} added these features 22525(some of which originated in @command{gawk}): 22526 22527@itemize @bullet 22528@item 22529The @code{ENVIRON} variable (@pxref{Built-in Variables}). 22530@c gawk and MKS awk 22531 22532@item 22533Multiple @option{-f} options on the command line 22534(@pxref{Options}). 22535@c MKS awk 22536 22537@item 22538The @option{-v} option for assigning variables before program execution begins 22539(@pxref{Options}). 22540@c GNU, Bell Laboratories & MKS together 22541 22542@item 22543The @option{--} option for terminating command-line options. 22544 22545@item 22546The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences 22547(@pxref{Escape Sequences}). 22548@c GNU, for ANSI C compat 22549 22550@item 22551A defined return value for the @code{srand} built-in function 22552(@pxref{Numeric Functions}). 22553 22554@item 22555The @code{toupper} and @code{tolower} built-in string functions 22556for case translation 22557(@pxref{String Functions}). 22558 22559@item 22560A cleaner specification for the @samp{%c} format-control letter in the 22561@code{printf} function 22562(@pxref{Control Letters}). 22563 22564@item 22565The ability to dynamically pass the field width and precision (@code{"%*.*d"}) 22566in the argument list of the @code{printf} function 22567(@pxref{Control Letters}). 22568 22569@item 22570The use of regexp constants, such as @code{/foo/}, as expressions, where 22571they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/} 22572(@pxref{Using Constant Regexps}). 22573 22574@item 22575Processing of escape sequences inside command-line variable assignments 22576(@pxref{Assignment Options}). 22577@end itemize 22578 22579@node POSIX 22580@appendixsec Changes Between SVR4 and POSIX @command{awk} 22581@cindex @command{awk}, versions of, changes between SVR4 and POSIX @command{awk} 22582@cindex POSIX @command{awk}, changes in @command{awk} versions 22583 22584The POSIX Command Language and Utilities standard for @command{awk} (1992) 22585introduced the following changes into the language: 22586 22587@itemize @bullet 22588@item 22589The use of @option{-W} for implementation-specific options 22590(@pxref{Options}). 22591 22592@item 22593The use of @code{CONVFMT} for controlling the conversion of numbers 22594to strings (@pxref{Conversion}). 22595 22596@item 22597The concept of a numeric string and tighter comparison rules to go 22598with it (@pxref{Typing and Comparison}). 22599 22600@item 22601More complete documentation of many of the previously undocumented 22602features of the language. 22603@end itemize 22604 22605The following common extensions are not permitted by the POSIX 22606standard: 22607 22608@c IMPORTANT! Keep this list in sync with the one in node Options 22609 22610@itemize @bullet 22611@item 22612@code{\x} escape sequences are not recognized 22613(@pxref{Escape Sequences}). 22614 22615@item 22616Newlines do not act as whitespace to separate fields when @code{FS} is 22617equal to a single space 22618(@pxref{Fields}). 22619 22620@item 22621Newlines are not allowed after @samp{?} or @samp{:} 22622(@pxref{Conditional Exp}). 22623 22624@item 22625The synonym @code{func} for the keyword @code{function} is not 22626recognized (@pxref{Definition Syntax}). 22627 22628@item 22629The operators @samp{**} and @samp{**=} cannot be used in 22630place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops}, 22631and @ref{Assignment Ops}). 22632 22633@item 22634Specifying @samp{-Ft} on the command line does not set the value 22635of @code{FS} to be a single TAB character 22636(@pxref{Field Separators}). 22637 22638@item 22639The @code{fflush} built-in function is not supported 22640(@pxref{I/O Functions}). 22641@end itemize 22642@c ENDOFRANGE gawkv 22643 22644@node BTL 22645@appendixsec Extensions in the Bell Laboratories @command{awk} 22646 22647@cindex @command{awk}, versions of, See Also Bell Laboratories @command{awk} 22648@cindex extensions, Bell Laboratories @command{awk} 22649@cindex Bell Laboratories @command{awk} extensions 22650@cindex Kernighan, Brian 22651Brian Kernighan, one of the original designers of Unix @command{awk}, 22652has made his version available via his home page 22653(@pxref{Other Versions}). 22654This @value{SECTION} describes extensions in his version of @command{awk} that are 22655not in POSIX @command{awk}: 22656 22657@itemize @bullet 22658@item 22659The @samp{-mf @var{N}} and @samp{-mr @var{N}} command-line options 22660to set the maximum number of fields and the maximum 22661record size, respectively 22662(@pxref{Options}). 22663As a side note, his @command{awk} no longer needs these options; 22664it continues to accept them to avoid breaking old programs. 22665 22666@item 22667The @code{fflush} built-in function for flushing buffered output 22668(@pxref{I/O Functions}). 22669 22670@item 22671The @samp{**} and @samp{**=} operators 22672(@pxref{Arithmetic Ops} 22673and 22674@ref{Assignment Ops}). 22675 22676@item 22677The use of @code{func} as an abbreviation for @code{function} 22678(@pxref{Definition Syntax}). 22679 22680@ignore 22681@item 22682The @code{SYMTAB} array, that allows access to @command{awk}'s internal symbol 22683table. This feature is not documented, largely because 22684it is somewhat shakily implemented. For instance, you cannot access arrays 22685or array elements through it. 22686@end ignore 22687@end itemize 22688 22689The Bell Laboratories @command{awk} also incorporates the following extensions, 22690originally developed for @command{gawk}: 22691 22692@itemize @bullet 22693@item 22694The @samp{\x} escape sequence 22695(@pxref{Escape Sequences}). 22696 22697@item 22698The @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr} 22699special files 22700(@pxref{Special Files}). 22701 22702@item 22703The ability for @code{FS} and for the third 22704argument to @code{split} to be null strings 22705(@pxref{Single Character Fields}). 22706 22707@item 22708The @code{nextfile} statement 22709(@pxref{Nextfile Statement}). 22710 22711@item 22712The ability to delete all of an array at once with @samp{delete @var{array}} 22713(@pxref{Delete}). 22714@end itemize 22715 22716@node POSIX/GNU 22717@appendixsec Extensions in @command{gawk} Not in POSIX @command{awk} 22718 22719@ignore 22720I've tried to follow this general order, esp. for the 3.0 and 3.1 sections: 22721 variables 22722 special files 22723 language changes (e.g., hex constants) 22724 differences in standard awk functions 22725 new gawk functions 22726 new keywords 22727 new command-line options 22728 new ports 22729Within each category, be alphabetical. 22730@end ignore 22731 22732@c STARTOFRANGE fripls 22733@cindex compatibility mode (@command{gawk}), extensions 22734@c STARTOFRANGE exgnot 22735@cindex extensions, in @command{gawk}, not in POSIX @command{awk} 22736@c STARTOFRANGE posnot 22737@cindex POSIX, @command{gawk} extensions not included in 22738The GNU implementation, @command{gawk}, adds a large number of features. 22739This @value{SECTION} lists them in the order they were added to @command{gawk}. 22740They can all be disabled with either the @option{--traditional} or 22741@option{--posix} options 22742(@pxref{Options}). 22743 22744Version 2.10 of @command{gawk} introduced the following features: 22745 22746@itemize @bullet 22747@item 22748The @env{AWKPATH} environment variable for specifying a path search for 22749the @option{-f} command-line option 22750(@pxref{Options}). 22751 22752@item 22753The @code{IGNORECASE} variable and its effects 22754(@pxref{Case-sensitivity}). 22755 22756@item 22757The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr} and 22758@file{/dev/fd/@var{N}} special @value{FN}s 22759(@pxref{Special Files}). 22760@end itemize 22761 22762Version 2.13 of @command{gawk} introduced the following features: 22763 22764@itemize @bullet 22765@item 22766The @code{FIELDWIDTHS} variable and its effects 22767(@pxref{Constant Size}). 22768 22769@item 22770The @code{systime} and @code{strftime} built-in functions for obtaining 22771and printing timestamps 22772(@pxref{Time Functions}). 22773 22774@item 22775The @option{-W lint} option to provide error and portability checking 22776for both the source code and at runtime 22777(@pxref{Options}). 22778 22779@item 22780The @option{-W compat} option to turn off the GNU extensions 22781(@pxref{Options}). 22782 22783@item 22784The @option{-W posix} option for full POSIX compliance 22785(@pxref{Options}). 22786@end itemize 22787 22788Version 2.14 of @command{gawk} introduced the following feature: 22789 22790@itemize @bullet 22791@item 22792The @code{next file} statement for skipping to the next @value{DF} 22793(@pxref{Nextfile Statement}). 22794@end itemize 22795 22796Version 2.15 of @command{gawk} introduced the following features: 22797 22798@itemize @bullet 22799@item 22800The @code{ARGIND} variable, which tracks the movement of @code{FILENAME} 22801through @code{ARGV} (@pxref{Built-in Variables}). 22802 22803@item 22804The @code{ERRNO} variable, which contains the system error message when 22805@code{getline} returns @minus{}1 or @code{close} fails 22806(@pxref{Built-in Variables}). 22807 22808@item 22809The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and 22810@file{/dev/user} @value{FN} interpretation 22811(@pxref{Special Files}). 22812 22813@item 22814The ability to delete all of an array at once with @samp{delete @var{array}} 22815(@pxref{Delete}). 22816 22817@item 22818The ability to use GNU-style long-named options that start with @option{--} 22819(@pxref{Options}). 22820 22821@item 22822The @option{--source} option for mixing command-line and library-file 22823source code 22824(@pxref{Options}). 22825@end itemize 22826 22827Version 3.0 of @command{gawk} introduced the following features: 22828 22829@itemize @bullet 22830@item 22831@code{IGNORECASE} changed, now applying to string comparison as well 22832as regexp operations 22833(@pxref{Case-sensitivity}). 22834 22835@item 22836The @code{RT} variable that contains the input text that 22837matched @code{RS} 22838(@pxref{Records}). 22839 22840@item 22841Full support for both POSIX and GNU regexps 22842(@pxref{Regexp}). 22843 22844@item 22845The @code{gensub} function for more powerful text manipulation 22846(@pxref{String Functions}). 22847 22848@item 22849The @code{strftime} function acquired a default time format, 22850allowing it to be called with no arguments 22851(@pxref{Time Functions}). 22852 22853@item 22854The ability for @code{FS} and for the third 22855argument to @code{split} to be null strings 22856(@pxref{Single Character Fields}). 22857 22858@item 22859The ability for @code{RS} to be a regexp 22860(@pxref{Records}). 22861 22862@item 22863The @code{next file} statement became @code{nextfile} 22864(@pxref{Nextfile Statement}). 22865 22866@item 22867The @option{--lint-old} option to 22868warn about constructs that are not available in 22869the original Version 7 Unix version of @command{awk} 22870(@pxref{V7/SVR3.1}). 22871 22872@item 22873The @option{-m} option and the @code{fflush} function from the 22874Bell Laboratories research version of @command{awk} 22875(@pxref{Options}; also 22876@pxref{I/O Functions}). 22877 22878@item 22879The @option{--re-interval} option to provide interval expressions in regexps 22880(@pxref{Regexp Operators}). 22881 22882@item 22883The @option{--traditional} option was added as a better name for 22884@option{--compat} (@pxref{Options}). 22885 22886@item 22887The use of GNU Autoconf to control the configuration process 22888(@pxref{Quick Installation}). 22889 22890@item 22891Amiga support 22892(@pxref{Amiga Installation}). 22893 22894@end itemize 22895 22896Version 3.1 of @command{gawk} introduced the following features: 22897 22898@itemize @bullet 22899@item 22900The @code{BINMODE} special variable for non-POSIX systems, 22901which allows binary I/O for input and/or output files 22902(@pxref{PC Using}). 22903 22904@item 22905The @code{LINT} special variable, which dynamically controls lint warnings 22906(@pxref{Built-in Variables}). 22907 22908@item 22909The @code{PROCINFO} array for providing process-related information 22910(@pxref{Built-in Variables}). 22911 22912@item 22913The @code{TEXTDOMAIN} special variable for setting an application's 22914internationalization text domain 22915(@pxref{Built-in Variables}, 22916and 22917@ref{Internationalization}). 22918 22919@item 22920The ability to use octal and hexadecimal constants in @command{awk} 22921program source code 22922(@pxref{Nondecimal-numbers}). 22923 22924@item 22925The @samp{|&} operator for two-way I/O to a coprocess 22926(@pxref{Two-way I/O}). 22927 22928@item 22929The @file{/inet} special files for TCP/IP networking using @samp{|&} 22930(@pxref{TCP/IP Networking}). 22931 22932@item 22933The optional second argument to @code{close} that allows closing one end 22934of a two-way pipe to a coprocess 22935(@pxref{Two-way I/O}). 22936 22937@item 22938The optional third argument to the @code{match} function 22939for capturing text-matching subexpressions within a regexp 22940(@pxref{String Functions}). 22941 22942@item 22943Positional specifiers in @code{printf} formats for 22944making translations easier 22945(@pxref{Printf Ordering}). 22946 22947@item 22948The @code{asort} and @code{asorti} functions for sorting arrays 22949(@pxref{Array Sorting}). 22950 22951@item 22952The @code{bindtextdomain}, @code{dcgettext} and @code{dcngettext} functions 22953for internationalization 22954(@pxref{Programmer i18n}). 22955 22956@item 22957The @code{extension} built-in function and the ability to add 22958new built-in functions dynamically 22959(@pxref{Dynamic Extensions}). 22960 22961@item 22962The @code{mktime} built-in function for creating timestamps 22963(@pxref{Time Functions}). 22964 22965@item 22966The 22967@code{and}, 22968@code{or}, 22969@code{xor}, 22970@code{compl}, 22971@code{lshift}, 22972@code{rshift}, 22973and 22974@code{strtonum} built-in 22975functions 22976(@pxref{Bitwise Functions}). 22977 22978@item 22979@cindex @code{next file} statement 22980The support for @samp{next file} as two words was removed completely 22981(@pxref{Nextfile Statement}). 22982 22983@item 22984The @option{--dump-variables} option to print a list of all global variables 22985(@pxref{Options}). 22986 22987@item 22988The @option{--gen-po} command-line option and the use of a leading 22989underscore to mark strings that should be translated 22990(@pxref{String Extraction}). 22991 22992@item 22993The @option{--non-decimal-data} option to allow non-decimal 22994input data 22995(@pxref{Nondecimal Data}). 22996 22997@item 22998The @option{--profile} option and @command{pgawk}, the 22999profiling version of @command{gawk}, for producing execution 23000profiles of @command{awk} programs 23001(@pxref{Profiling}). 23002 23003@item 23004The @option{--enable-portals} configuration option to enable special treatment of 23005pathnames that begin with @file{/p} as BSD portals 23006(@pxref{Portal Files}). 23007 23008@item 23009The use of GNU Automake to help in standardizing the configuration process 23010(@pxref{Quick Installation}). 23011 23012@item 23013The use of GNU @code{gettext} for @command{gawk}'s own message output 23014(@pxref{Gawk I18N}). 23015 23016@item 23017BeOS support 23018(@pxref{BeOS Installation}). 23019 23020@item 23021Tandem support 23022(@pxref{Tandem Installation}). 23023 23024@item 23025The Atari port became officially unsupported 23026(@pxref{Atari Installation}). 23027 23028@item 23029The source code now uses new-style function definitions, with 23030@command{ansi2knr} to convert the code on systems with old compilers. 23031 23032@item 23033The @option{--disable-lint} configuration option to disable lint checking 23034at compile time 23035(@pxref{Additional Configuration Options}). 23036 23037@end itemize 23038 23039@c XXX ADD MORE STUFF HERE 23040 23041@c ENDOFRANGE fripls 23042@c ENDOFRANGE exgnot 23043@c ENDOFRANGE posnot 23044 23045@node Contributors 23046@appendixsec Major Contributors to @command{gawk} 23047@cindex @command{gawk}, list of contributors to 23048@quotation 23049@i{Always give credit where credit is due.}@* 23050Anonymous 23051@end quotation 23052 23053This @value{SECTION} names the major contributors to @command{gawk} 23054and/or this @value{DOCUMENT}, in approximate chronological order: 23055 23056@itemize @bullet 23057@item 23058@cindex Aho, Alfred 23059@cindex Weinberger, Peter 23060@cindex Kernighan, Brian 23061Dr.@: Alfred V.@: Aho, 23062Dr.@: Peter J.@: Weinberger, and 23063Dr.@: Brian W.@: Kernighan, all of Bell Laboratories, 23064designed and implemented Unix @command{awk}, 23065from which @command{gawk} gets the majority of its feature set. 23066 23067@item 23068@cindex Rubin, Paul 23069Paul Rubin 23070did the initial design and implementation in 1986, and wrote 23071the first draft (around 40 pages) of this @value{DOCUMENT}. 23072 23073@item 23074@cindex Fenlason, Jay 23075Jay Fenlason 23076finished the initial implementation. 23077 23078@item 23079@cindex Close, Diane 23080Diane Close 23081revised the first draft of this @value{DOCUMENT}, bringing it 23082to around 90 pages. 23083 23084@item 23085@cindex Stallman, Richard 23086Richard Stallman 23087helped finish the implementation and the initial draft of this 23088@value{DOCUMENT}. 23089He is also the founder of the FSF and the GNU project. 23090 23091@item 23092@cindex Woods, John 23093John Woods 23094contributed parts of the code (mostly fixes) in 23095the initial version of @command{gawk}. 23096 23097@item 23098@cindex Trueman, David 23099In 1988, 23100David Trueman 23101took over primary maintenance of @command{gawk}, 23102making it compatible with ``new'' @command{awk}, and 23103greatly improving its performance. 23104 23105@item 23106@cindex Rankin, Pat 23107Pat Rankin 23108provided the VMS port and its documentation. 23109 23110@item 23111@cindex Kwok, Conrad 23112@cindex Garfinkle, Scott 23113@cindex Williams, Kent 23114Conrad Kwok, 23115Scott Garfinkle, 23116and 23117Kent Williams 23118did the initial ports to MS-DOS with various versions of MSC. 23119 23120@item 23121@cindex Peterson, Hal 23122Hal Peterson 23123provided help in porting @command{gawk} to Cray systems. 23124 23125@item 23126@cindex Rommel, Kai Uwe 23127Kai Uwe Rommel 23128provided the initial port to OS/2 and its documentation. 23129 23130@item 23131@cindex Jaegermann, Michal 23132Michal Jaegermann 23133provided the port to Atari systems and its documentation. 23134He continues to provide portability checking with DEC Alpha 23135systems, and has done a lot of work to make sure @command{gawk} 23136works on non-32-bit systems. 23137 23138@item 23139@cindex Fish, Fred 23140Fred Fish 23141provided the port to Amiga systems and its documentation. 23142 23143@item 23144@cindex Deifik, Scott 23145Scott Deifik 23146currently maintains the MS-DOS port. 23147 23148@item 23149@cindex Grigera, Juan 23150Juan Grigera 23151maintains the port to Windows32 systems. 23152 23153@item 23154@cindex Hankerson, Darrel 23155Dr.@: Darrel Hankerson 23156acts as coordinator for the various ports to different PC platforms 23157and creates binary distributions for various PC operating systems. 23158He is also instrumental in keeping the documentation up to date for 23159the various PC platforms. 23160 23161@item 23162@cindex Zoulas, Christos 23163Christos Zoulas 23164provided the @code{extension} 23165built-in function for dynamically adding new modules. 23166 23167@item 23168@cindex Kahrs, J@"urgen 23169J@"urgen Kahrs 23170contributed the initial version of the TCP/IP networking 23171code and documentation, and motivated the inclusion of the @samp{|&} operator. 23172 23173@item 23174@cindex Davies, Stephen 23175Stephen Davies 23176provided the port to Tandem systems and its documentation. 23177 23178@item 23179@cindex Brown, Martin 23180Martin Brown 23181provided the port to BeOS and its documentation. 23182 23183@item 23184@cindex Peters, Arno 23185Arno Peters 23186did the initial work to convert @command{gawk} to use 23187GNU Automake and @code{gettext}. 23188 23189@item 23190@cindex Broder, Alan J.@: 23191Alan J.@: Broder 23192provided the initial version of the @code{asort} function 23193as well as the code for the new optional third argument to the @code{match} function. 23194 23195@item 23196@cindex Buening, Andreas 23197Andreas Buening 23198updated the @command{gawk} port for OS/2. 23199 23200@cindex Hasegawa, Isamu 23201Isamu Hasegawa, 23202of IBM in Japan, contributed support for multibyte characters. 23203 23204@cindex Benzinger, Michael 23205Michael Benzinger contributed the initial code for @code{switch} statements. 23206 23207@cindex McPhee, Patrick 23208Patrick T.J.@: McPhee contributed the code for dynamic loading in Windows32 23209environments. 23210 23211@item 23212@cindex Robbins, Arnold 23213Arnold Robbins 23214has been working on @command{gawk} since 1988, at first 23215helping David Trueman, and as the primary maintainer since around 1994. 23216@end itemize 23217 23218@node Installation 23219@appendix Installing @command{gawk} 23220 23221@c last two commas are part of see also 23222@cindex operating systems, See Also GNU/Linux, PC operating systems, Unix 23223@c STARTOFRANGE gligawk 23224@cindex @command{gawk}, installing 23225@c STARTOFRANGE ingawk 23226@cindex installing @command{gawk} 23227This appendix provides instructions for installing @command{gawk} on the 23228various platforms that are supported by the developers. The primary 23229developer supports GNU/Linux (and Unix), whereas the other ports are 23230contributed. 23231@xref{Bugs}, 23232for the electronic mail addresses of the people who did 23233the respective ports. 23234 23235@menu 23236* Gawk Distribution:: What is in the @command{gawk} distribution. 23237* Unix Installation:: Installing @command{gawk} under various 23238 versions of Unix. 23239* Non-Unix Installation:: Installation on Other Operating Systems. 23240* Unsupported:: Systems whose ports are no longer supported. 23241* Bugs:: Reporting Problems and Bugs. 23242* Other Versions:: Other freely available @command{awk} 23243 implementations. 23244@end menu 23245 23246@node Gawk Distribution 23247@appendixsec The @command{gawk} Distribution 23248@cindex source code, @command{gawk} 23249 23250This @value{SECTION} describes how to get the @command{gawk} 23251distribution, how to extract it, and then what is in the various files and 23252subdirectories. 23253 23254@menu 23255* Getting:: How to get the distribution. 23256* Extracting:: How to extract the distribution. 23257* Distribution contents:: What is in the distribution. 23258@end menu 23259 23260@node Getting 23261@appendixsubsec Getting the @command{gawk} Distribution 23262@c last comma is part of secondary 23263@cindex @command{gawk}, source code, obtaining 23264There are three ways to get GNU software: 23265 23266@itemize @bullet 23267@item 23268Copy it from someone else who already has it. 23269 23270@cindex FSF (Free Software Foundation) 23271@cindex Free Software Foundation (FSF) 23272@item 23273Order @command{gawk} directly from the Free Software Foundation. 23274Software distributions are available for 23275Gnu/Linux, Unix, and MS-Windows, in several CD packages. 23276Their address is: 23277 23278@display 23279Free Software Foundation 2328059 Temple Place, Suite 330 23281Boston, MA 02111-1307 USA 23282Phone: +1-617-542-5942 23283Fax (including Japan): +1-617-542-2652 23284Email: @email{gnu@@gnu.org} 23285URL: @uref{http://www.gnu.org} 23286@end display 23287 23288@noindent 23289Ordering from the FSF directly contributes to the support of the foundation 23290and to the production of more free software. 23291 23292@item 23293Retrieve @command{gawk} by using anonymous @command{ftp} to the Internet host 23294@code{ftp.gnu.org}, in the directory @file{/gnu/gawk}. 23295@end itemize 23296 23297The GNU software archive is mirrored around the world. 23298The up-to-date list of mirror sites is available from 23299@uref{http://www.gnu.org/order/ftp.html, the main FSF web site}. 23300Try to use one of the mirrors; they 23301will be less busy, and you can usually find one closer to your site. 23302 23303@node Extracting 23304@appendixsubsec Extracting the Distribution 23305@command{gawk} is distributed as a @code{tar} file compressed with the 23306GNU Zip program, @code{gzip}. 23307 23308Once you have the distribution (for example, 23309@file{gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz}), 23310use @code{gzip} to expand the 23311file and then use @code{tar} to extract it. You can use the following 23312pipeline to produce the @command{gawk} distribution: 23313 23314@example 23315# Under System V, add 'o' to the tar options 23316gzip -d -c gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz | tar -xvpf - 23317@end example 23318 23319@noindent 23320This creates a directory named @file{gawk-@value{VERSION}.@value{PATCHLEVEL}} 23321in the current directory. 23322 23323The distribution @value{FN} is of the form 23324@file{gawk-@var{V}.@var{R}.@var{P}.tar.gz}. 23325The @var{V} represents the major version of @command{gawk}, 23326the @var{R} represents the current release of version @var{V}, and 23327the @var{P} represents a @dfn{patch level}, meaning that minor bugs have 23328been fixed in the release. The current patch level is @value{PATCHLEVEL}, 23329but when retrieving distributions, you should get the version with the highest 23330version, release, and patch level. (Note, however, that patch levels greater than 23331or equal to 80 denote ``beta'' or nonproduction software; you might not want 23332to retrieve such a version unless you don't mind experimenting.) 23333If you are not on a Unix system, you need to make other arrangements 23334for getting and extracting the @command{gawk} distribution. You should consult 23335a local expert. 23336 23337@node Distribution contents 23338@appendixsubsec Contents of the @command{gawk} Distribution 23339@c STARTOFRANGE gawdis 23340@cindex @command{gawk}, distribution 23341 23342The @command{gawk} distribution has a number of C source files, 23343documentation files, 23344subdirectories, and files related to the configuration process 23345(@pxref{Unix Installation}), 23346as well as several subdirectories related to different non-Unix 23347operating systems: 23348 23349@table @asis 23350@item Various @samp{.c}, @samp{.y}, and @samp{.h} files 23351The actual @command{gawk} source code. 23352@end table 23353 23354@table @file 23355@item README 23356@itemx README_d/README.* 23357Descriptive files: @file{README} for @command{gawk} under Unix and the 23358rest for the various hardware and software combinations. 23359 23360@item INSTALL 23361A file providing an overview of the configuration and installation process. 23362 23363@item ChangeLog 23364A detailed list of source code changes as bugs are fixed or improvements made. 23365 23366@item NEWS 23367A list of changes to @command{gawk} since the last release or patch. 23368 23369@item COPYING 23370The GNU General Public License. 23371 23372@item FUTURES 23373A brief list of features and changes being contemplated for future 23374releases, with some indication of the time frame for the feature, based 23375on its difficulty. 23376 23377@item LIMITATIONS 23378A list of those factors that limit @command{gawk}'s performance. 23379Most of these depend on the hardware or operating system software and 23380are not limits in @command{gawk} itself. 23381 23382@item POSIX.STD 23383A description of one area in which the POSIX standard for @command{awk} is 23384incorrect as well as how @command{gawk} handles the problem. 23385 23386@c comma is part of primary 23387@cindex artificial intelligence, @command{gawk} and 23388@item doc/awkforai.txt 23389A short article describing why @command{gawk} is a good language for 23390AI (Artificial Intelligence) programming. 23391 23392@item doc/README.card 23393@itemx doc/ad.block 23394@itemx doc/awkcard.in 23395@itemx doc/cardfonts 23396@itemx doc/colors 23397@itemx doc/macros 23398@itemx doc/no.colors 23399@itemx doc/setter.outline 23400The @command{troff} source for a five-color @command{awk} reference card. 23401A modern version of @command{troff} such as GNU @command{troff} (@command{groff}) is 23402needed to produce the color version. See the file @file{README.card} 23403for instructions if you have an older @command{troff}. 23404 23405@item doc/gawk.1 23406The @command{troff} source for a manual page describing @command{gawk}. 23407This is distributed for the convenience of Unix users. 23408 23409@cindex Texinfo 23410@item doc/gawk.texi 23411The Texinfo source file for this @value{DOCUMENT}. 23412It should be processed with @TeX{} to produce a printed document, and 23413with @command{makeinfo} to produce an Info or HTML file. 23414 23415@item doc/awk.info 23416The generated Info file for this @value{DOCUMENT}. 23417 23418@item doc/gawkinet.texi 23419The Texinfo source file for 23420@ifinfo 23421@xref{Top}. 23422@end ifinfo 23423@ifnotinfo 23424@cite{TCP/IP Internetworking with @command{gawk}}. 23425@end ifnotinfo 23426It should be processed with @TeX{} to produce a printed document and 23427with @command{makeinfo} to produce an Info or HTML file. 23428 23429@item doc/gawkinet.info 23430The generated Info file for 23431@cite{TCP/IP Internetworking with @command{gawk}}. 23432 23433@item doc/igawk.1 23434The @command{troff} source for a manual page describing the @command{igawk} 23435program presented in 23436@ref{Igawk Program}. 23437 23438@item doc/Makefile.in 23439The input file used during the configuration process to generate the 23440actual @file{Makefile} for creating the documentation. 23441 23442@item Makefile.am 23443@itemx */Makefile.am 23444Files used by the GNU @command{automake} software for generating 23445the @file{Makefile.in} files used by @command{autoconf} and 23446@command{configure}. 23447 23448@item Makefile.in 23449@itemx acconfig.h 23450@itemx acinclude.m4 23451@itemx aclocal.m4 23452@itemx configh.in 23453@itemx configure.in 23454@itemx configure 23455@itemx custom.h 23456@itemx missing_d/* 23457@itemx m4/* 23458These files and subdirectories are used when configuring @command{gawk} 23459for various Unix systems. They are explained in 23460@ref{Unix Installation}. 23461 23462@item intl/* 23463@itemx po/* 23464The @file{intl} directory provides the GNU @code{gettext} library, which implements 23465@command{gawk}'s internationalization features, while the @file{po} library 23466contains message translations. 23467 23468@item awklib/extract.awk 23469@itemx awklib/Makefile.am 23470@itemx awklib/Makefile.in 23471@itemx awklib/eg/* 23472The @file{awklib} directory contains a copy of @file{extract.awk} 23473(@pxref{Extract Program}), 23474which can be used to extract the sample programs from the Texinfo 23475source file for this @value{DOCUMENT}. It also contains a @file{Makefile.in} file, which 23476@command{configure} uses to generate a @file{Makefile}. 23477@file{Makefile.am} is used by GNU Automake to create @file{Makefile.in}. 23478The library functions from 23479@ref{Library Functions}, 23480and the @command{igawk} program from 23481@ref{Igawk Program}, 23482are included as ready-to-use files in the @command{gawk} distribution. 23483They are installed as part of the installation process. 23484The rest of the programs in this @value{DOCUMENT} are available in appropriate 23485subdirectories of @file{awklib/eg}. 23486 23487@item unsupported/atari/* 23488Files needed for building @command{gawk} on an Atari ST 23489(@pxref{Atari Installation}, for details). 23490 23491@item unsupported/tandem/* 23492Files needed for building @command{gawk} on a Tandem 23493(@pxref{Tandem Installation}, for details). 23494 23495@item posix/* 23496Files needed for building @command{gawk} on POSIX-compliant systems. 23497 23498@item pc/* 23499Files needed for building @command{gawk} under MS-DOS, MS Windows and OS/2 23500(@pxref{PC Installation}, for details). 23501 23502@item vms/* 23503Files needed for building @command{gawk} under VMS 23504(@pxref{VMS Installation}, for details). 23505 23506@item test/* 23507A test suite for 23508@command{gawk}. You can use @samp{make check} from the top-level @command{gawk} 23509directory to run your version of @command{gawk} against the test suite. 23510If @command{gawk} successfully passes @samp{make check}, then you can 23511be confident of a successful port. 23512@end table 23513@c ENDOFRANGE gawdis 23514 23515@node Unix Installation 23516@appendixsec Compiling and Installing @command{gawk} on Unix 23517 23518Usually, you can compile and install @command{gawk} by typing only two 23519commands. However, if you use an unusual system, you may need 23520to configure @command{gawk} for your system yourself. 23521 23522@menu 23523* Quick Installation:: Compiling @command{gawk} under Unix. 23524* Additional Configuration Options:: Other compile-time options. 23525* Configuration Philosophy:: How it's all supposed to work. 23526@end menu 23527 23528@node Quick Installation 23529@appendixsubsec Compiling @command{gawk} for Unix 23530 23531@c @cindex installation, unix 23532After you have extracted the @command{gawk} distribution, @command{cd} 23533to @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}. Like most GNU software, 23534@command{gawk} is configured 23535automatically for your Unix system by running the @command{configure} program. 23536This program is a Bourne shell script that is generated automatically using 23537GNU @command{autoconf}. 23538@ifnotinfo 23539(The @command{autoconf} software is 23540described fully in 23541@cite{Autoconf---Generating Automatic Configuration Scripts}, 23542which is available from the Free Software Foundation.) 23543@end ifnotinfo 23544@ifinfo 23545(The @command{autoconf} software is described fully starting with 23546@ref{Top}.) 23547@end ifinfo 23548 23549To configure @command{gawk}, simply run @command{configure}: 23550 23551@example 23552sh ./configure 23553@end example 23554 23555This produces a @file{Makefile} and @file{config.h} tailored to your system. 23556The @file{config.h} file describes various facts about your system. 23557You might want to edit the @file{Makefile} to 23558change the @code{CFLAGS} variable, which controls 23559the command-line options that are passed to the C compiler (such as 23560optimization levels or compiling for debugging). 23561 23562Alternatively, you can add your own values for most @command{make} 23563variables on the command line, such as @code{CC} and @code{CFLAGS}, when 23564running @command{configure}: 23565 23566@example 23567CC=cc CFLAGS=-g sh ./configure 23568@end example 23569 23570@noindent 23571See the file @file{INSTALL} in the @command{gawk} distribution for 23572all the details. 23573 23574After you have run @command{configure} and possibly edited the @file{Makefile}, 23575type: 23576 23577@example 23578make 23579@end example 23580 23581@noindent 23582Shortly thereafter, you should have an executable version of @command{gawk}. 23583That's all there is to it! 23584To verify that @command{gawk} is working properly, 23585run @samp{make check}. All of the tests should succeed. 23586If these steps do not work, or if any of the tests fail, 23587check the files in the @file{README_d} directory to see if you've 23588found a known problem. If the failure is not described there, 23589please send in a bug report 23590(@pxref{Bugs}.) 23591 23592@node Additional Configuration Options 23593@appendixsubsec Additional Configuration Options 23594@cindex @command{gawk}, configuring, options 23595@c comma is part of primary 23596@cindex configuration options, @command{gawk} 23597 23598There are several additional options you may use on the @command{configure} 23599command line when compiling @command{gawk} from scratch, including: 23600 23601@table @code 23602@cindex @code{--enable-portals} configuration option 23603@cindex configuration option, @code{--enable-portals} 23604@item --enable-portals 23605Treat pathnames that begin 23606with @file{/p} as BSD portal files when doing two-way I/O with 23607the @samp{|&} operator 23608(@pxref{Portal Files}). 23609 23610@cindex @code{--enable-switch} configuration option 23611@cindex configuration option, @code{--enable-switch} 23612@item --enable-switch 23613Enable the recognition and execution of C-style @code{switch} statements 23614in @command{awk} programs 23615(@pxref{Switch Statement}.) 23616 23617@cindex Linux 23618@cindex GNU/Linux 23619@cindex @code{--with-included-gettext} configuration option 23620@cindex @code{--with-included-gettext} configuration option, configuring @command{gawk} with 23621@cindex configuration option, @code{--with-included-gettext} 23622@item --with-included-gettext 23623Use the version of the @code{gettext} library that comes with @command{gawk}. 23624This option should be used on systems that do @emph{not} use @value{PVERSION} 2 (or later) 23625of the GNU C library. 23626All known modern GNU/Linux systems use Glibc 2. Use this option on any other system. 23627 23628@cindex @code{--disable-lint} configuration option 23629@cindex configuration option, @code{--disable-lint} 23630@item --disable-lint 23631This option disables all lint checking within @code{gawk}. The 23632@option{--lint} and @option{--lint-old} options 23633(@pxref{Options}) 23634are accepted, but silently do nothing. 23635Similarly, setting the @code{LINT} variable 23636(@pxref{User-modified}) 23637has no effect on the running @command{awk} program. 23638 23639When used with GCC's automatic dead-code-elimination, this option 23640cuts almost 200K bytes off the size of the @command{gawk} 23641executable on GNU/Linux x86 systems. Results on other systems and 23642with other compilers are likely to vary. 23643Using this option may bring you some slight performance improvement. 23644 23645Using this option will cause some of the tests in the test suite 23646to fail. This option may be removed at a later date. 23647 23648@cindex @code{--disable-nls} configuration option 23649@cindex configuration option, @code{--disable-nls} 23650@item --disable-nls 23651Disable all message-translation facilities. 23652This is usually not desirable, but it may bring you some slight performance 23653improvement. 23654You should also use this option if @option{--with-included-gettext} 23655doesn't work on your system. 23656@end table 23657 23658@node Configuration Philosophy 23659@appendixsubsec The Configuration Process 23660 23661@cindex @command{gawk}, configuring 23662This @value{SECTION} is of interest only if you know something about using the 23663C language and the Unix operating system. 23664 23665The source code for @command{gawk} generally attempts to adhere to formal 23666standards wherever possible. This means that @command{gawk} uses library 23667routines that are specified by the ISO C standard and by the POSIX 23668operating system interface standard. When using an ISO C compiler, 23669function prototypes are used to help improve the compile-time checking. 23670 23671Many Unix systems do not support all of either the ISO or the 23672POSIX standards. The @file{missing_d} subdirectory in the @command{gawk} 23673distribution contains replacement versions of those functions that are 23674most likely to be missing. 23675 23676The @file{config.h} file that @command{configure} creates contains 23677definitions that describe features of the particular operating system 23678where you are attempting to compile @command{gawk}. The three things 23679described by this file are: what header files are available, so that 23680they can be correctly included, what (supposedly) standard functions 23681are actually available in your C libraries, and various miscellaneous 23682facts about your variant of Unix. For example, there may not be an 23683@code{st_blksize} element in the @code{stat} structure. In this case, 23684@samp{HAVE_ST_BLKSIZE} is undefined. 23685 23686@cindex @code{custom.h} file 23687It is possible for your C compiler to lie to @command{configure}. It may 23688do so by not exiting with an error when a library function is not 23689available. To get around this, edit the file @file{custom.h}. 23690Use an @samp{#ifdef} that is appropriate for your system, and either 23691@code{#define} any constants that @command{configure} should have defined but 23692didn't, or @code{#undef} any constants that @command{configure} defined and 23693should not have. @file{custom.h} is automatically included by 23694@file{config.h}. 23695 23696It is also possible that the @command{configure} program generated by 23697@command{autoconf} will not work on your system in some other fashion. 23698If you do have a problem, the file @file{configure.in} is the input for 23699@command{autoconf}. You may be able to change this file and generate a 23700new version of @command{configure} that works on your system 23701(@pxref{Bugs}, 23702for information on how to report problems in configuring @command{gawk}). 23703The same mechanism may be used to send in updates to @file{configure.in} 23704and/or @file{custom.h}. 23705 23706@node Non-Unix Installation 23707@appendixsec Installation on Other Operating Systems 23708 23709This @value{SECTION} describes how to install @command{gawk} on 23710various non-Unix systems. 23711 23712@menu 23713* Amiga Installation:: Installing @command{gawk} on an Amiga. 23714* BeOS Installation:: Installing @command{gawk} on BeOS. 23715* PC Installation:: Installing and Compiling @command{gawk} on 23716 MS-DOS and OS/2. 23717* VMS Installation:: Installing @command{gawk} on VMS. 23718@end menu 23719 23720@node Amiga Installation 23721@appendixsubsec Installing @command{gawk} on an Amiga 23722 23723@cindex amiga 23724@cindex installation, amiga 23725You can install @command{gawk} on an Amiga system using a Unix emulation 23726environment, available via anonymous @command{ftp} from 23727@code{ftp.ninemoons.com} in the directory @file{pub/ade/current}. 23728This includes a shell based on @command{pdksh}. The primary component of 23729this environment is a Unix emulation library, @file{ixemul.lib}. 23730@c could really use more background here, who wrote this, etc. 23731 23732A more complete distribution for the Amiga is available on 23733the Geek Gadgets CD-ROM, available from: 23734 23735@display 23736CRONUS 237371840 E. Warner Road #105-265 23738Tempe, AZ 85284 USA 23739US Toll Free: (800) 804-0833 23740Phone: +1-602-491-0442 23741FAX: +1-602-491-0048 23742Email: @email{info@@ninemoons.com} 23743WWW: @uref{http://www.ninemoons.com} 23744Anonymous @command{ftp} site: @code{ftp.ninemoons.com} 23745@end display 23746 23747Once you have the distribution, you can configure @command{gawk} simply by 23748running @command{configure}: 23749 23750@example 23751configure -v m68k-amigaos 23752@end example 23753 23754Then run @command{make} and you should be all set! 23755If these steps do not work, please send in a bug report 23756(@pxref{Bugs}). 23757 23758@node BeOS Installation 23759@appendixsubsec Installing @command{gawk} on BeOS 23760@cindex BeOS 23761@cindex installation, beos 23762 23763@c From email contributed by Martin Brown, mc@whoever.com 23764Since BeOS DR9, all the tools that you should need to build @code{gawk} are 23765included with BeOS. The process is basically identical to the Unix process 23766of running @command{configure} and then @command{make}. Full instructions are given below. 23767 23768You can compile @command{gawk} under BeOS by extracting the standard sources 23769and running @command{configure}. You @emph{must} specify the location 23770prefix for the installation directory. For BeOS DR9 and beyond, the best directory to 23771use is @file{/boot/home/config}, so the @command{configure} command is: 23772 23773@example 23774configure --prefix=/boot/home/config 23775@end example 23776 23777This installs the compiled application into @file{/boot/home/config/bin}, 23778which is already specified in the standard @env{PATH}. 23779 23780Once the configuration process is completed, you can run @command{make}, 23781and then @samp{make install}: 23782 23783@example 23784$ make 23785@dots{} 23786$ make install 23787@end example 23788 23789BeOS uses @command{bash} as its shell; thus, you use @command{gawk} the same way you would 23790under Unix. 23791If these steps do not work, please send in a bug report 23792(@pxref{Bugs}). 23793 23794@c Rewritten by Scott Deifik <scottd@amgen.com> 23795@c and Darrel Hankerson <hankedr@mail.auburn.edu> 23796 23797@node PC Installation 23798@appendixsubsec Installation on PC Operating Systems 23799 23800@c first comma is part of primary 23801@cindex PC operating systems, @command{gawk} on, installing 23802@c {PC, gawk on} is the secondary term 23803@cindex operating systems, PC, @command{gawk} on, installing 23804This @value{SECTION} covers installation and usage of @command{gawk} on x86 machines 23805running DOS, any version of Windows, or OS/2. 23806In this @value{SECTION}, the term ``Windows32'' 23807refers to any of Windows-95/98/ME/NT/2000. 23808 23809The limitations of DOS (and DOS shells under Windows or OS/2) has meant 23810that various ``DOS extenders'' are often used with programs such as 23811@command{gawk}. The varying capabilities of Microsoft Windows 3.1 23812and Windows32 can add to the confusion. For an overview of the 23813considerations, please refer to @file{README_d/README.pc} in the 23814distribution. 23815 23816@menu 23817* PC Binary Installation:: Installing a prepared distribution. 23818* PC Compiling:: Compiling @command{gawk} for MS-DOS, Windows32, 23819 and OS/2. 23820* PC Dynamic:: Compiling @command{gawk} for dynamic libraries. 23821* PC Using:: Running @command{gawk} on MS-DOS, Windows32 and 23822 OS/2. 23823* Cygwin:: Building and running @command{gawk} for 23824 Cygwin. 23825@end menu 23826 23827@node PC Binary Installation 23828@appendixsubsubsec Installing a Prepared Distribution for PC Systems 23829 23830If you have received a binary distribution prepared by the DOS 23831maintainers, then @command{gawk} and the necessary support files appear 23832under the @file{gnu} directory, with executables in @file{gnu/bin}, 23833libraries in @file{gnu/lib/awk}, and manual pages under @file{gnu/man}. 23834This is designed for easy installation to a @file{/gnu} directory on your 23835drive---however, the files can be installed anywhere provided @env{AWKPATH} is 23836set properly. Regardless of the installation directory, the first line of 23837@file{igawk.cmd} and @file{igawk.bat} (in @file{gnu/bin}) may need to be 23838edited. 23839 23840The binary distribution contains a separate file describing the 23841contents. In particular, it may include more than one version of the 23842@command{gawk} executable. 23843 23844OS/2 (32 bit, EMX) binary distributions are prepared for the @file{/usr} 23845directory of your preferred drive. Set @env{UNIXROOT} to your installation 23846drive (e.g., @samp{e:}) if you want to install @command{gawk} onto another drive 23847than the hardcoded default @samp{c:}. Executables appear in @file{/usr/bin}, 23848libraries under @file{/usr/share/awk}, manual pages under @file{/usr/man}, 23849Texinfo documentation under @file{/usr/info} and NLS files under @file{/usr/share/locale}. 23850If you already have a file @file{/usr/info/dir} from another package 23851@emph{do not overwrite it!} Instead enter the following commands at your prompt 23852(replace @samp{x:} by your installation drive): 23853 23854@example 23855install-info --info-dir=x:/usr/info x:/usr/info/awk.info 23856install-info --info-dir=x:/usr/info x:/usr/info/gawkinet.info 23857@end example 23858 23859However, the files can be installed anywhere provided @env{AWKPATH} is 23860set properly. 23861 23862The binary distribution may contain a separate file containing additional 23863or more detailed installation instructions. 23864 23865@node PC Compiling 23866@appendixsubsubsec Compiling @command{gawk} for PC Operating Systems 23867 23868@command{gawk} can be compiled for MS-DOS, Windows32, and OS/2 using the GNU 23869development tools from DJ Delorie (DJGPP; MS-DOS only) or Eberhard 23870Mattes (EMX; MS-DOS, Windows32 and OS/2). Microsoft Visual C/C++ can be used 23871to build a Windows32 version, and Microsoft C/C++ can be 23872used to build 16-bit versions for MS-DOS and OS/2. 23873@c FIXME: 23874(As of @command{gawk} 3.1.2, the MSC version doesn't work. However, 23875the maintainer is working on fixing it.) 23876The file 23877@file{README_d/README.pc} in the @command{gawk} distribution contains 23878additional notes, and @file{pc/Makefile} contains important information on 23879compilation options. 23880 23881To build @command{gawk} for MS-DOS, Windows32, and OS/2 (16 bit only; for 32 bit 23882(EMX) you can use the @command{configure} script and skip the following paragraphs; 23883for details see below), copy the files in the @file{pc} directory (@emph{except} 23884for @file{ChangeLog}) to the directory with the rest of the @command{gawk} 23885sources. The @file{Makefile} contains a configuration section with comments and 23886may need to be edited in order to work with your @command{make} utility. 23887 23888The @file{Makefile} contains a number of targets for building various MS-DOS, 23889Windows32, and OS/2 versions. A list of targets is printed if the @command{make} 23890command is given without a target. As an example, to build @command{gawk} 23891using the DJGPP tools, enter @samp{make djgpp}. 23892 23893Using @command{make} to run the standard tests and to install @command{gawk} 23894requires additional Unix-like tools, including @command{sh}, @command{sed}, and 23895@command{cp}. In order to run the tests, the @file{test/*.ok} files may need to 23896be converted so that they have the usual DOS-style end-of-line markers. Most 23897of the tests work properly with Stewartson's shell along with the 23898companion utilities or appropriate GNU utilities. However, some editing of 23899@file{test/Makefile} is required. It is recommended that you copy the file 23900@file{pc/Makefile.tst} over the file @file{test/Makefile} as a 23901replacement. Details can be found in @file{README_d/README.pc} 23902and in the file @file{pc/Makefile.tst}. 23903 23904The 32 bit EMX version of @command{gawk} works ``out of the box'' under OS/2. 23905In principle, it is possible to compile @command{gawk} the following way: 23906 23907@example 23908$ ./configure 23909$ make 23910@end example 23911 23912This is not recommended, though. To get an OMF executable you should 23913use the following commands at your @command{sh} prompt: 23914 23915@example 23916$ CPPFLAGS="-D__ST_MT_ERRNO__" 23917$ export CPPFLAGS 23918$ CFLAGS="-O2 -Zomf -Zmt" 23919$ export CFLAGS 23920$ LDFLAGS="-s -Zcrtdll -Zlinker /exepack:2 -Zlinker /pm:vio -Zstack 0x8000" 23921$ export LDFLAGS 23922$ RANLIB="echo" 23923$ export RANLIB 23924$ ./configure --prefix=c:/usr --without-included-gettext 23925$ make AR=emxomfar 23926@end example 23927 23928These are just suggestions. You may use any other set of (self-consistent) 23929environment variables and compiler flags. 23930 23931To get an FHS-compliant file hierarchy it is recommended to use the additional 23932@command{configure} options @option{--infodir=c:/usr/share/info}, @option{--mandir=c:/usr/share/man} 23933and @option{--libexecdir=c:/usr/lib}. 23934 23935The internal @code{gettext} library tends to be problematic. It is therefore recommended 23936to use either an external one (@option{--without-included-gettext}) or to disable 23937NLS entirely (@option{--disable-nls}). 23938 23939If you use GCC 2.95 or newer it is recommended to use also: 23940 23941@example 23942$ LIBS="-lgcc" 23943$ export LIBS 23944@end example 23945 23946You can also get an @code{a.out} executable if you prefer: 23947 23948@example 23949$ CPPFLAGS="-D__ST_MT_ERRNO__" 23950$ export CPPFLAGS 23951$ CFLAGS="-O2 -Zmt" 23952$ export CFLAGS 23953$ LDFLAGS="-s -Zstack 0x8000" 23954$ LIBS="-lgcc" 23955$ unset RANLIB 23956$ ./configure --prefix=c:/usr --without-included-gettext 23957$ make 23958@end example 23959 23960@strong{Note:} Even if the compiled @command{gawk.exe} (@code{a.out}) executable 23961contains a DOS header, it does @emph{not} work under DOS. To compile an executable 23962that runs under DOS, @code{"-DPIPES_SIMULATED"} must be added to @env{CPPFLAGS}. 23963But then some nonstandard extensions of @command{gawk} (e.g., @samp{|&}) do not work! 23964 23965After compilation the internal tests can be performed. Enter 23966@samp{make check CMP="diff -a"} at your command prompt. All tests 23967but the @code{pid} test are expected to work properly. The @code{pid} 23968test fails because child processes are not started by @code{fork()}. 23969 23970@samp{make install} works as expected. 23971 23972@strong{Note:} Most OS/2 ports of GNU @command{make} are not able to handle 23973the Makefiles of this package. If you encounter any problems with @command{make} 23974try GNU Make 3.79.1 or later versions. You should find the latest 23975version on @uref{http://www.unixos2.org/sw/pub/binary/make/} or on 23976@uref{ftp://hobbes.nmsu.edu/pub/os2/}. 23977 23978@node PC Dynamic 23979@appendixsubsubsec Compiling @command{gawk} For Dynamic Libraries 23980 23981@c From README_d/README.pcdynamic 23982@c 11 June 2003 23983 23984To compile @command{gawk} with dynamic extension support, 23985uncomment the definitions of @code{DYN_FLAGS}, @code{DYN_EXP}, 23986@code{DYN_OBJ}, and @code{DYN_MAKEXP} in the configuration section of 23987the @file{Makefile}. There are two definitions for @code{DYN_MAKEXP}: 23988pick the one that matches your target. 23989 23990To build some of the example extension libraries, @command{cd} to the 23991extension directory and copy @file{Makefile.pc} to @file{Makefile}. You 23992can then build using the same two targets. To run the example 23993@command{awk} scripts, you'll need to either change the call to 23994the @code{extension} function to match the name of the library (for 23995instance, change @code{"./ordchr.so"} to @code{"ordchr.dll"} or simply 23996@code{"ordchr"}), or rename the library to match the call (for instance, 23997rename @file{ordchr.dll} to @file{ordchr.so}). 23998 23999If you build @command{gawk.exe} with one compiler but want to build 24000an extension library with the other, you need to copy the import 24001library. Visual C uses a library called @file{gawk.lib}, while MinGW uses 24002a library called @file{libgawk.a}. These files are equivalent and will 24003interoperate if you give them the correct name. The resulting shared 24004libraries are also interoperable. 24005 24006To create your own extension library, you can use the examples as models, 24007but you're essentially on your own. Post to @code{comp.lang.awk} or 24008send electronic mail to @email{ptjm@@interlog.com} if you have problems getting 24009started. If you need to access functions or variables which are not 24010exported by @command{gawk.exe}, add them to @file{gawkw32.def} and 24011rebuild. You should also add @code{ATTRIBUTE_EXPORTED} to the declaration 24012in @file{awk.h} of any variables you add to @file{gawkw32.def}. 24013 24014Note that extension libraries have the name of the @command{awk} 24015executable embedded in them at link time, so they will work only 24016with @command{gawk.exe}. In particular, they won't work if you 24017rename @command{gawk.exe} to @command{awk.exe} or if you try to use 24018@command{pgawk.exe}. You can perform profiling by temporarily renaming 24019@command{pgawk.exe} to @command{gawk.exe}. You can resolve this problem 24020by changing the program name in the definition of @code{DYN_MAKEXP} 24021for your compiler. 24022 24023On Windows32, libraries are sought first in the current directory, then in 24024the directory containing @command{gawk.exe}, and finally through the 24025@env{PATH} environment variable. 24026 24027@node PC Using 24028@appendixsubsubsec Using @command{gawk} on PC Operating Systems 24029@c STARTOFRANGE opgawx 24030@cindex operating systems, PC, @command{gawk} on 24031@c STARTOFRANGE pcgawon 24032@cindex PC operating systems, @command{gawk} on 24033 24034With the exception of the Cygwin environment, 24035the @samp{|&} operator and TCP/IP networking 24036(@pxref{TCP/IP Networking}) 24037are not supported for MS-DOS or MS-Windows. EMX (OS/2 only) does support 24038at least the @samp{|&} operator. 24039 24040@cindex search paths 24041@cindex @command{gawk}, OS/2 version of 24042@cindex @command{gawk}, MS-DOS version of 24043@cindex @code{;} (semicolon), @code{AWKPATH} variable and 24044@cindex semicolon (@code{;}), @code{AWKPATH} variable and 24045@cindex @code{AWKPATH} environment variable 24046The OS/2 and MS-DOS versions of @command{gawk} search for program files as 24047described in @ref{AWKPATH Variable}. 24048However, semicolons (rather than colons) separate elements 24049in the @env{AWKPATH} variable. If @env{AWKPATH} is not set or is empty, 24050then the default search path for OS/2 (16 bit) and MS-DOS versions is 24051@code{@w{".;c:/lib/awk;c:/gnu/lib/awk"}}. 24052 24053The search path for OS/2 (32 bit, EMX) is determined by the prefix directory 24054(most likely @file{/usr} or @file{c:/usr}) that has been specified as an option of 24055the @command{configure} script like it is the case for the Unix versions. 24056If @file{c:/usr} is the prefix directory then the default search path contains @file{.} 24057and @file{c:/usr/share/awk}. 24058Additionally, to support binary distributions of @command{gawk} for OS/2 24059systems whose drive @samp{c:} might not support long file names or might not exist 24060at all, there is a special environment variable. If @env{UNIXROOT} specifies 24061a drive then this specific drive is also searched for program files. 24062E.g., if @env{UNIXROOT} is set to @file{e:} the complete default search path is 24063@code{@w{".;c:/usr/share/awk;e:/usr/share/awk"}}. 24064 24065An @command{sh}-like shell (as opposed to @command{command.com} under MS-DOS 24066or @command{cmd.exe} under OS/2) may be useful for @command{awk} programming. 24067Ian Stewartson has written an excellent shell for MS-DOS and OS/2, 24068Daisuke Aoyama has ported GNU @command{bash} to MS-DOS using the DJGPP tools, 24069and several shells are available for OS/2, including @command{ksh}. The file 24070@file{README_d/README.pc} in the @command{gawk} distribution contains 24071information on these shells. Users of Stewartson's shell on DOS should 24072examine its documentation for handling command lines; in particular, 24073the setting for @command{gawk} in the shell configuration may need to be 24074changed and the @code{ignoretype} option may also be of interest. 24075 24076@cindex differences in @command{awk} and @command{gawk}, @code{BINMODE} variable 24077@cindex @code{BINMODE} variable 24078Under OS/2 and DOS, @command{gawk} (and many other text programs) silently 24079translate end-of-line @code{"\r\n"} to @code{"\n"} on input and @code{"\n"} 24080to @code{"\r\n"} on output. A special @code{BINMODE} variable allows 24081control over these translations and is interpreted as follows: 24082 24083@itemize @bullet 24084@item 24085If @code{BINMODE} is @samp{"r"}, or 24086@code{(BINMODE & 1)} is nonzero, then 24087binary mode is set on read (i.e., no translations on reads). 24088 24089@item 24090If @code{BINMODE} is @code{"w"}, or 24091@code{(BINMODE & 2)} is nonzero, then 24092binary mode is set on write (i.e., no translations on writes). 24093 24094@item 24095If @code{BINMODE} is @code{"rw"} or @code{"wr"}, 24096binary mode is set for both read and write 24097(same as @code{(BINMODE & 3)}). 24098 24099@item 24100@code{BINMODE=@var{non-null-string}} is 24101the same as @samp{BINMODE=3} (i.e., no translations on 24102reads or writes). However, @command{gawk} issues a warning 24103message if the string is not one of @code{"rw"} or @code{"wr"}. 24104@end itemize 24105 24106@noindent 24107The modes for standard input and standard output are set one time 24108only (after the 24109command line is read, but before processing any of the @command{awk} program). 24110Setting @code{BINMODE} for standard input or 24111standard output is accomplished by using an 24112appropriate @samp{-v BINMODE=@var{N}} option on the command line. 24113@code{BINMODE} is set at the time a file or pipe is opened and cannot be 24114changed mid-stream. 24115 24116The name @code{BINMODE} was chosen to match @command{mawk} 24117(@pxref{Other Versions}). 24118Both @command{mawk} and @command{gawk} handle @code{BINMODE} similarly; however, 24119@command{mawk} adds a @samp{-W BINMODE=@var{N}} option and an environment 24120variable that can set @code{BINMODE}, @code{RS}, and @code{ORS}. The 24121files @file{binmode[1-3].awk} (under @file{gnu/lib/awk} in some of the 24122prepared distributions) have been chosen to match @command{mawk}'s @samp{-W 24123BINMODE=@var{N}} option. These can be changed or discarded; in particular, 24124the setting of @code{RS} giving the fewest ``surprises'' is open to debate. 24125@command{mawk} uses @samp{RS = "\r\n"} if binary mode is set on read, which is 24126appropriate for files with the DOS-style end-of-line. 24127 24128To illustrate, the following examples set binary mode on writes for standard 24129output and other files, and set @code{ORS} as the ``usual'' DOS-style 24130end-of-line: 24131 24132@example 24133gawk -v BINMODE=2 -v ORS="\r\n" @dots{} 24134@end example 24135 24136@noindent 24137or: 24138 24139@example 24140gawk -v BINMODE=w -f binmode2.awk @dots{} 24141@end example 24142 24143@noindent 24144These give the same result as the @samp{-W BINMODE=2} option in 24145@command{mawk}. 24146The following changes the record separator to @code{"\r\n"} and sets binary 24147mode on reads, but does not affect the mode on standard input: 24148 24149@example 24150gawk -v RS="\r\n" --source "BEGIN @{ BINMODE = 1 @}" @dots{} 24151@end example 24152 24153@noindent 24154or: 24155 24156@example 24157gawk -f binmode1.awk @dots{} 24158@end example 24159 24160@noindent 24161With proper quoting, in the first example the setting of @code{RS} can be 24162moved into the @code{BEGIN} rule. 24163 24164@node Cygwin 24165@appendixsubsubsec Using @command{gawk} In The Cygwin Environment 24166 24167@command{gawk} can be used ``out of the box'' under Windows if you are 24168using the Cygwin environment.@footnote{@uref{http://www.cygwin.com}} 24169This environment provides an excellent simulation of Unix, using the 24170GNU tools, such as @command{bash}, the GNU Compiler Collection (GCC), 24171GNU Make, and other GNU tools. Compilation and installation for Cygwin 24172is the same as for a Unix system: 24173 24174@example 24175tar -xvpzf gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz 24176cd gawk-@value{VERSION}.@value{PATCHLEVEL} 24177./configure 24178make 24179@end example 24180 24181When compared to GNU/Linux on the same system, the @samp{configure} 24182step on Cygwin takes considerably longer. However, it does finish, 24183and then the @samp{make} proceeds as usual. 24184 24185@strong{Note:} The @samp{|&} operator and TCP/IP networking 24186(@pxref{TCP/IP Networking}) 24187are fully supported in the Cygwin environment. This is not true 24188for any other environment for MS-DOS or MS-Windows. 24189 24190@node VMS Installation 24191@appendixsubsec How to Compile and Install @command{gawk} on VMS 24192 24193@c based on material from Pat Rankin <rankin@eql.caltech.edu> 24194@c now rankin@pactechdata.com 24195 24196@cindex installation, vms 24197This @value{SUBSECTION} describes how to compile and install @command{gawk} under VMS. 24198 24199@menu 24200* VMS Compilation:: How to compile @command{gawk} under VMS. 24201* VMS Installation Details:: How to install @command{gawk} under VMS. 24202* VMS Running:: How to run @command{gawk} under VMS. 24203* VMS POSIX:: Alternate instructions for VMS POSIX. 24204@end menu 24205 24206@node VMS Compilation 24207@appendixsubsubsec Compiling @command{gawk} on VMS 24208 24209To compile @command{gawk} under VMS, there is a @code{DCL} command procedure that 24210issues all the necessary @code{CC} and @code{LINK} commands. There is 24211also a @file{Makefile} for use with the @code{MMS} utility. From the source 24212directory, use either: 24213 24214@example 24215$ @@[.VMS]VMSBUILD.COM 24216@end example 24217 24218@noindent 24219or: 24220 24221@example 24222$ MMS/DESCRIPTION=[.VMS]DESCRIP.MMS GAWK 24223@end example 24224 24225Depending upon which C compiler you are using, follow one of the sets 24226of instructions in this table: 24227 24228@table @asis 24229@item VAX C V3.x 24230Use either @file{vmsbuild.com} or @file{descrip.mms} as is. These use 24231@code{CC/OPTIMIZE=NOLINE}, which is essential for Version 3.0. 24232 24233@item VAX C V2.x 24234You must have Version 2.3 or 2.4; older ones won't work. Edit either 24235@file{vmsbuild.com} or @file{descrip.mms} according to the comments in them. 24236For @file{vmsbuild.com}, this just entails removing two @samp{!} delimiters. 24237Also edit @file{config.h} (which is a copy of file @file{[.config]vms-conf.h}) 24238and comment out or delete the two lines @samp{#define __STDC__ 0} and 24239@samp{#define VAXC_BUILTINS} near the end. 24240 24241@item GNU C 24242Edit @file{vmsbuild.com} or @file{descrip.mms}; the changes are different 24243from those for VAX C V2.x but equally straightforward. No changes to 24244@file{config.h} are needed. 24245 24246@item DEC C 24247Edit @file{vmsbuild.com} or @file{descrip.mms} according to their comments. 24248No changes to @file{config.h} are needed. 24249@end table 24250 24251@command{gawk} has been tested under VAX/VMS 5.5-1 using VAX C V3.2, and 24252GNU C 1.40 and 2.3. It should work without modifications for VMS V4.6 and up. 24253 24254@node VMS Installation Details 24255@appendixsubsubsec Installing @command{gawk} on VMS 24256 24257To install @command{gawk}, all you need is a ``foreign'' command, which is 24258a @code{DCL} symbol whose value begins with a dollar sign. For example: 24259 24260@example 24261$ GAWK :== $disk1:[gnubin]GAWK 24262@end example 24263 24264@noindent 24265Substitute the actual location of @command{gawk.exe} for 24266@samp{$disk1:[gnubin]}. The symbol should be placed in the 24267@file{login.com} of any user who wants to run @command{gawk}, 24268so that it is defined every time the user logs on. 24269Alternatively, the symbol may be placed in the system-wide 24270@file{sylogin.com} procedure, which allows all users 24271to run @command{gawk}. 24272 24273Optionally, the help entry can be loaded into a VMS help library: 24274 24275@example 24276$ LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP 24277@end example 24278 24279@noindent 24280(You may want to substitute a site-specific help library rather than 24281the standard VMS library @samp{HELPLIB}.) After loading the help text, 24282the command: 24283 24284@example 24285$ HELP GAWK 24286@end example 24287 24288@noindent 24289provides information about both the @command{gawk} implementation and the 24290@command{awk} programming language. 24291 24292The logical name @samp{AWK_LIBRARY} can designate a default location 24293for @command{awk} program files. For the @option{-f} option, if the specified 24294@value{FN} has no device or directory path information in it, @command{gawk} 24295looks in the current directory first, then in the directory specified 24296by the translation of @samp{AWK_LIBRARY} if the file is not found. 24297If, after searching in both directories, the file still is not found, 24298@command{gawk} appends the suffix @samp{.awk} to the filename and retries 24299the file search. If @samp{AWK_LIBRARY} is not defined, that 24300portion of the file search fails benignly. 24301 24302@node VMS Running 24303@appendixsubsubsec Running @command{gawk} on VMS 24304 24305Command-line parsing and quoting conventions are significantly different 24306on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor 24307changes. They @emph{are} minor though, and all @command{awk} programs 24308should run correctly. 24309 24310Here are a couple of trivial tests: 24311 24312@example 24313$ gawk -- "BEGIN @{print ""Hello, World!""@}" 24314$ gawk -"W" version 24315! could also be -"W version" or "-W version" 24316@end example 24317 24318@noindent 24319Note that uppercase and mixed-case text must be quoted. 24320 24321The VMS port of @command{gawk} includes a @code{DCL}-style interface in addition 24322to the original shell-style interface (see the help entry for details). 24323One side effect of dual command-line parsing is that if there is only a 24324single parameter (as in the quoted string program above), the command 24325becomes ambiguous. To work around this, the normally optional @option{--} 24326flag is required to force Unix style rather than @code{DCL} parsing. If any 24327other dash-type options (or multiple parameters such as @value{DF}s to 24328process) are present, there is no ambiguity and @option{--} can be omitted. 24329 24330@c @cindex directory search 24331@c @cindex path, search 24332@cindex search paths 24333@cindex search paths, for source files 24334The default search path, when looking for @command{awk} program files specified 24335by the @option{-f} option, is @code{"SYS$DISK:[],AWK_LIBRARY:"}. The logical 24336name @samp{AWKPATH} can be used to override this default. The format 24337of @samp{AWKPATH} is a comma-separated list of directory specifications. 24338When defining it, the value should be quoted so that it retains a single 24339translation and not a multitranslation @code{RMS} searchlist. 24340 24341@node VMS POSIX 24342@appendixsubsubsec Building and Using @command{gawk} on VMS POSIX 24343 24344Ignore the instructions above, although @file{vms/gawk.hlp} should still 24345be made available in a help library. The source tree should be unpacked 24346into a container file subsystem rather than into the ordinary VMS filesystem. 24347Make sure that the two scripts, @file{configure} and 24348@file{vms/posix-cc.sh}, are executable; use @samp{chmod +x} on them if 24349necessary. Then execute the following two commands: 24350 24351@example 24352psx> CC=vms/posix-cc.sh configure 24353psx> make CC=c89 gawk 24354@end example 24355 24356@noindent 24357The first command constructs files @file{config.h} and @file{Makefile} out 24358of templates, using a script to make the C compiler fit @command{configure}'s 24359expectations. The second command compiles and links @command{gawk} using 24360the C compiler directly; ignore any warnings from @command{make} about being 24361unable to redefine @code{CC}. @command{configure} takes a very long 24362time to execute, but at least it provides incremental feedback as it runs. 24363 24364This has been tested with VAX/VMS V6.2, VMS POSIX V2.0, and DEC C V5.2. 24365 24366Once built, @command{gawk} works like any other shell utility. Unlike 24367the normal VMS port of @command{gawk}, no special command-line manipulation is 24368needed in the VMS POSIX environment. 24369 24370@node Unsupported 24371@appendixsec Unsupported Operating System Ports 24372 24373This sections describes systems for which 24374the @command{gawk} port is no longer supported. 24375 24376@menu 24377* Atari Installation:: Installing @command{gawk} on the Atari ST. 24378* Tandem Installation:: Installing @command{gawk} on a Tandem. 24379@end menu 24380 24381@node Atari Installation 24382@appendixsubsec Installing @command{gawk} on the Atari ST 24383 24384The Atari port is no longer supported. It is 24385included for those who might want to use it but it is no longer being 24386actively maintained. 24387 24388@c based on material from Michal Jaegermann <michal@gortel.phys.ualberta.ca> 24389@cindex atari 24390@cindex installation, atari 24391There are no substantial differences when installing @command{gawk} on 24392various Atari models. Compiled @command{gawk} executables do not require 24393a large amount of memory with most @command{awk} programs, and should run on all 24394Motorola processor-based models (called further ST, even if that is not 24395exactly right). 24396 24397In order to use @command{gawk}, you need to have a shell, either text or 24398graphics, that does not map all the characters of a command line to 24399uppercase. Maintaining case distinction in option flags is very 24400important (@pxref{Options}). 24401These days this is the default and it may only be a problem for some 24402very old machines. If your system does not preserve the case of option 24403flags, you need to upgrade your tools. Support for I/O 24404redirection is necessary to make it easy to import @command{awk} programs 24405from other environments. Pipes are nice to have but not vital. 24406 24407@menu 24408* Atari Compiling:: Compiling @command{gawk} on Atari. 24409* Atari Using:: Running @command{gawk} on Atari. 24410@end menu 24411 24412@node Atari Compiling 24413@appendixsubsubsec Compiling @command{gawk} on the Atari ST 24414 24415A proper compilation of @command{gawk} sources when @code{sizeof(int)} 24416differs from @code{sizeof(void *)} requires an ISO C compiler. An initial 24417port was done with @command{gcc}. You may actually prefer executables 24418where @code{int}s are four bytes wide but the other variant works as well. 24419 24420You may need quite a bit of memory when trying to recompile the @command{gawk} 24421sources, as some source files (@file{regex.c} in particular) are quite 24422big. If you run out of memory compiling such a file, try reducing the 24423optimization level for this particular file, which may help. 24424 24425@cindex Linux 24426@cindex GNU/Linux 24427With a reasonable shell (@command{bash} will do), you have a pretty good chance 24428that the @command{configure} utility will succeed, and in particular if 24429you run GNU/Linux, MiNT or a similar operating system. Otherwise 24430sample versions of @file{config.h} and @file{Makefile.st} are given in the 24431@file{atari} subdirectory and can be edited and copied to the 24432corresponding files in the main source directory. Even if 24433@command{configure} produces something, it might be advisable to compare 24434its results with the sample versions and possibly make adjustments. 24435 24436Some @command{gawk} source code fragments depend on a preprocessor define 24437@samp{atarist}. This basically assumes the TOS environment with @command{gcc}. 24438Modify these sections as appropriate if they are not right for your 24439environment. Also see the remarks about @env{AWKPATH} and @code{envsep} in 24440@ref{Atari Using}. 24441 24442As shipped, the sample @file{config.h} claims that the @code{system} 24443function is missing from the libraries, which is not true, and an 24444alternative implementation of this function is provided in 24445@file{unsupported/atari/system.c}. 24446Depending upon your particular combination of 24447shell and operating system, you might want to change the file to indicate 24448that @code{system} is available. 24449 24450@node Atari Using 24451@appendixsubsubsec Running @command{gawk} on the Atari ST 24452 24453An executable version of @command{gawk} should be placed, as usual, 24454anywhere in your @env{PATH} where your shell can find it. 24455 24456While executing, the Atari version of @command{gawk} creates a number of temporary files. When 24457using @command{gcc} libraries for TOS, @command{gawk} looks for either of 24458the environment variables, @env{TEMP} or @env{TMPDIR}, in that order. 24459If either one is found, its value is assumed to be a directory for 24460temporary files. This directory must exist, and if you can spare the 24461memory, it is a good idea to put it on a RAM drive. If neither 24462@env{TEMP} nor @env{TMPDIR} are found, then @command{gawk} uses the 24463current directory for its temporary files. 24464 24465The ST version of @command{gawk} searches for its program files, as described in 24466@ref{AWKPATH Variable}. 24467The default value for the @env{AWKPATH} variable is taken from 24468@code{DEFPATH} defined in @file{Makefile}. The sample @command{gcc}/TOS 24469@file{Makefile} for the ST in the distribution sets @code{DEFPATH} to 24470@code{@w{".,c:\lib\awk,c:\gnu\lib\awk"}}. The search path can be 24471modified by explicitly setting @env{AWKPATH} to whatever you want. 24472Note that colons cannot be used on the ST to separate elements in the 24473@env{AWKPATH} variable, since they have another reserved meaning. 24474Instead, you must use a comma to separate elements in the path. When 24475recompiling, the separating character can be modified by initializing 24476the @code{envsep} variable in @file{unsupported/atari/gawkmisc.atr} to another 24477value. 24478 24479Although @command{awk} allows great flexibility in doing I/O redirections 24480from within a program, this facility should be used with care on the ST 24481running under TOS. In some circumstances, the OS routines for file-handle 24482pool processing lose track of certain events, causing the 24483computer to crash and requiring a reboot. Often a warm reboot is 24484sufficient. Fortunately, this happens infrequently and in rather 24485esoteric situations. In particular, avoid having one part of an 24486@command{awk} program using @code{print} statements explicitly redirected 24487to @file{/dev/stdout}, while other @code{print} statements use the 24488default standard output, and a calling shell has redirected standard 24489output to a file. 24490@c 10/2000: Is this still true, now that gawk does /dev/stdout internally? 24491 24492When @command{gawk} is compiled with the ST version of @command{gcc} and its 24493usual libraries, it accepts both @samp{/} and @samp{\} as path separators. 24494While this is convenient, it should be remembered that this removes one 24495technically valid character (@samp{/}) from your @value{FN}. 24496It may also create problems for external programs called via the @code{system} 24497function, which may not support this convention. Whenever it is possible 24498that a file created by @command{gawk} will be used by some other program, 24499use only backslashes. Also remember that in @command{awk}, backslashes in 24500strings have to be doubled in order to get literal backslashes 24501(@pxref{Escape Sequences}). 24502 24503@node Tandem Installation 24504@appendixsubsec Installing @command{gawk} on a Tandem 24505@cindex tandem 24506@cindex installation, tandem 24507 24508The Tandem port is only minimally supported. 24509The port's contributor no longer has access to a Tandem system. 24510 24511@c This section based on README.Tandem by Stephen Davies (scldad@sdc.com.au) 24512The Tandem port was done on a Cyclone machine running D20. 24513The port is pretty clean and all facilities seem to work except for 24514the I/O piping facilities 24515(@pxref{Getline/Pipe}, 24516@ref{Getline/Variable/Pipe}, 24517and 24518@ref{Redirection}), 24519which is just too foreign a concept for Tandem. 24520 24521To build a Tandem executable from source, download all of the files so 24522that the @value{FN}s on the Tandem box conform to the restrictions of D20. 24523For example, @file{array.c} becomes @file{ARRAYC}, and @file{awk.h} 24524becomes @file{AWKH}. The totally Tandem-specific files are in the 24525@file{tandem} ``subvolume'' (@file{unsupported/tandem} in the @command{gawk} 24526distribution) and should be copied to the main source directory before 24527building @command{gawk}. 24528 24529The file @file{compit} can then be used to compile and bind an executable. 24530Alas, there is no @command{configure} or @command{make}. 24531 24532Usage is the same as for Unix, except that D20 requires all @samp{@{} and 24533@samp{@}} characters to be escaped with @samp{~} on the command line 24534(but @emph{not} in script files). Also, the standard Tandem syntax for 24535@samp{/in filename,out filename/} must be used instead of the usual 24536Unix @samp{<} and @samp{>} for file redirection. (Redirection options 24537on @code{getline}, @code{print} etc., are supported.) 24538 24539The @samp{-mr @var{val}} option 24540(@pxref{Options}) 24541has been ``stolen'' to enable Tandem users to process fixed-length 24542records with no ``end-of-line'' character. That is, @samp{-mr 74} tells 24543@command{gawk} to read the input file as fixed 74-byte records. 24544@c ENDOFRANGE opgawx 24545@c ENDOFRANGE pcgawon 24546 24547@node Bugs 24548@appendixsec Reporting Problems and Bugs 24549@cindex archeologists 24550@quotation 24551@i{There is nothing more dangerous than a bored archeologist.}@* 24552The Hitchhiker's Guide to the Galaxy 24553@end quotation 24554@c the radio show, not the book. :-) 24555 24556@c STARTOFRANGE dbugg 24557@cindex debugging @command{gawk}, bug reports 24558@c STARTOFRANGE tblgawb 24559@cindex troubleshooting, @command{gawk}, bug reports 24560If you have problems with @command{gawk} or think that you have found a bug, 24561please report it to the developers; we cannot promise to do anything 24562but we might well want to fix it. 24563 24564Before reporting a bug, make sure you have actually found a real bug. 24565Carefully reread the documentation and see if it really says you can do 24566what you're trying to do. If it's not clear whether you should be able 24567to do something or not, report that too; it's a bug in the documentation! 24568 24569Before reporting a bug or trying to fix it yourself, try to isolate it 24570to the smallest possible @command{awk} program and input @value{DF} that 24571reproduces the problem. Then send us the program and @value{DF}, 24572some idea of what kind of Unix system you're using, 24573the compiler you used to compile @command{gawk}, and the exact results 24574@command{gawk} gave you. Also say what you expected to occur; this helps 24575us decide whether the problem is really in the documentation. 24576 24577@cindex @code{bug-gawk@@gnu.org} bug reporting address 24578@cindex email address for bug reports, @code{bug-gawk@@gnu.org} 24579@cindex bug reports, email address, @code{bug-gawk@@gnu.org} 24580Once you have a precise problem, send email to @email{bug-gawk@@gnu.org}. 24581 24582@cindex Robbins, Arnold 24583Please include the version number of @command{gawk} you are using. 24584You can get this information with the command @samp{gawk --version}. 24585Using this address automatically sends a carbon copy of your 24586mail to me. If necessary, I can be reached directly at 24587@email{arnold@@gnu.org}. The bug reporting address is preferred since the 24588email list is archived at the GNU Project. 24589@emph{All email should be in English, since that is my native language.} 24590 24591@cindex @code{comp.lang.awk} newsgroup 24592@strong{Caution:} Do @emph{not} try to report bugs in @command{gawk} by 24593posting to the Usenet/Internet newsgroup @code{comp.lang.awk}. 24594While the @command{gawk} developers do occasionally read this newsgroup, 24595there is no guarantee that we will see your posting. The steps described 24596above are the official recognized ways for reporting bugs. 24597 24598Non-bug suggestions are always welcome as well. If you have questions 24599about things that are unclear in the documentation or are just obscure 24600features, ask me; I will try to help you out, although I 24601may not have the time to fix the problem. You can send me electronic 24602mail at the Internet address noted previously. 24603 24604If you find bugs in one of the non-Unix ports of @command{gawk}, please send 24605an electronic mail message to the person who maintains that port. They 24606are named in the following list, as well as in the @file{README} file in the @command{gawk} 24607distribution. Information in the @file{README} file should be considered 24608authoritative if it conflicts with this @value{DOCUMENT}. 24609 24610The people maintaining the non-Unix ports of @command{gawk} are 24611as follows: 24612 24613@ignore 24614@table @asis 24615@cindex Fish, Fred 24616@item Amiga 24617Fred Fish, @email{fnf@@ninemoons.com}. 24618 24619@cindex Brown, Martin 24620@item BeOS 24621Martin Brown, @email{mc@@whoever.com}. 24622 24623@cindex Deifik, Scott 24624@cindex Hankerson, Darrel 24625@item MS-DOS 24626Scott Deifik, @email{scottd@@amgen.com} and 24627Darrel Hankerson, @email{hankedr@@mail.auburn.edu}. 24628 24629@cindex Grigera, Juan 24630@item MS-Windows 24631Juan Grigera, @email{juan@@biophnet.unlp.edu.ar}. 24632 24633@item OS/2 24634The Unix for OS/2 team, @email{gawk-maintainer@@unixos2.org}. 24635 24636@cindex Davies, Stephen 24637@item Tandem 24638Stephen Davies, @email{scldad@@sdc.com.au}. 24639 24640@cindex Rankin, Pat 24641@item VMS 24642Pat Rankin, @email{rankin@@pactechdata.com}. 24643@end table 24644@end ignore 24645 24646@multitable {MS-Windows} {123456789012345678901234567890123456789001234567890} 24647@cindex Fish, Fred 24648@item Amiga @tab Fred Fish, @email{fnf@@ninemoons.com}. 24649 24650@cindex Brown, Martin 24651@item BeOS @tab Martin Brown, @email{mc@@whoever.com}. 24652 24653@cindex Deifik, Scott 24654@cindex Hankerson, Darrel 24655@item MS-DOS @tab Scott Deifik, @email{scottd@@amgen.com} and 24656Darrel Hankerson, @email{hankedr@@mail.auburn.edu}. 24657 24658@cindex Grigera, Juan 24659@item MS-Windows @tab Juan Grigera, @email{juan@@biophnet.unlp.edu.ar}. 24660 24661@item OS/2 @tab The Unix for OS/2 team, @email{gawk-maintainer@@unixos2.org}. 24662 24663@cindex Davies, Stephen 24664@item Tandem @tab Stephen Davies, @email{scldad@@sdc.com.au}. 24665 24666@cindex Rankin, Pat 24667@item VMS @tab Pat Rankin, @email{rankin@@pactechdata.com}. 24668@end multitable 24669 24670If your bug is also reproducible under Unix, please send a copy of your 24671report to the @email{bug-gawk@@gnu.org} email list as well. 24672@c ENDOFRANGE dbugg 24673@c ENDOFRANGE tblgawb 24674 24675@node Other Versions 24676@appendixsec Other Freely Available @command{awk} Implementations 24677@c STARTOFRANGE awkim 24678@cindex @command{awk}, implementations 24679@ignore 24680From: emory!amc.com!brennan (Michael Brennan) 24681Subject: C++ comments in awk programs 24682To: arnold@gnu.ai.mit.edu (Arnold Robbins) 24683Date: Wed, 4 Sep 1996 08:11:48 -0700 (PDT) 24684 24685@end ignore 24686@cindex Brennan, Michael 24687@quotation 24688@i{It's kind of fun to put comments like this in your awk code.}@* 24689@ @ @ @ @ @ @code{// Do C++ comments work? answer: yes! of course}@* 24690Michael Brennan 24691@end quotation 24692 24693There are three other freely available @command{awk} implementations. 24694This @value{SECTION} briefly describes where to get them: 24695 24696@table @asis 24697@cindex Kernighan, Brian 24698@cindex source code, Bell Laboratories @command{awk} 24699@item Unix @command{awk} 24700Brian Kernighan has made his implementation of 24701@command{awk} freely available. 24702You can retrieve this version via the World Wide Web from 24703his home page.@footnote{@uref{http://cm.bell-labs.com/who/bwk}} 24704It is available in several archive formats: 24705 24706@table @asis 24707@item Shell archive 24708@uref{http://cm.bell-labs.com/who/bwk/awk.shar} 24709 24710@item Compressed @command{tar} file 24711@uref{http://cm.bell-labs.com/who/bwk/awk.tar.gz} 24712 24713@item Zip file 24714@uref{http://cm.bell-labs.com/who/bwk/awk.zip} 24715@end table 24716 24717This version requires an ISO C (1990 standard) compiler; 24718the C compiler from 24719GCC (the GNU Compiler Collection) 24720works quite nicely. 24721 24722@xref{BTL}, 24723for a list of extensions in this @command{awk} that are not in POSIX @command{awk}. 24724 24725@cindex Brennan, Michael 24726@cindex @command{mawk} program 24727@cindex source code, @command{mawk} 24728@item @command{mawk} 24729Michael Brennan has written an independent implementation of @command{awk}, 24730called @command{mawk}. It is available under the GPL 24731(@pxref{Copying}), 24732just as @command{gawk} is. 24733 24734You can get it via anonymous @command{ftp} to the host 24735@code{@w{ftp.whidbey.net}}. Change directory to @file{/pub/brennan}. 24736Use ``binary'' or ``image'' mode, and retrieve @file{mawk1.3.3.tar.gz} 24737(or the latest version that is there). 24738 24739@command{gunzip} may be used to decompress this file. Installation 24740is similar to @command{gawk}'s 24741(@pxref{Unix Installation}). 24742 24743@cindex extensions, @command{mawk} 24744@command{mawk} has the following extensions that are not in POSIX @command{awk}: 24745 24746@itemize @bullet 24747@item 24748The @code{fflush} built-in function for flushing buffered output 24749(@pxref{I/O Functions}). 24750 24751@item 24752The @samp{**} and @samp{**=} operators 24753(@pxref{Arithmetic Ops} 24754and also see 24755@ref{Assignment Ops}). 24756 24757@item 24758The use of @code{func} as an abbreviation for @code{function} 24759(@pxref{Definition Syntax}). 24760 24761@item 24762The @samp{\x} escape sequence 24763(@pxref{Escape Sequences}). 24764 24765@item 24766The @file{/dev/stdout}, and @file{/dev/stderr} 24767special files 24768(@pxref{Special Files}). 24769Use @code{"-"} instead of @code{"/dev/stdin"} with @command{mawk}. 24770 24771@item 24772The ability for @code{FS} and for the third 24773argument to @code{split} to be null strings 24774(@pxref{Single Character Fields}). 24775 24776@item 24777The ability to delete all of an array at once with @samp{delete @var{array}} 24778(@pxref{Delete}). 24779 24780@item 24781The ability for @code{RS} to be a regexp 24782(@pxref{Records}). 24783 24784@item 24785The @code{BINMODE} special variable for non-Unix operating systems 24786(@pxref{PC Using}). 24787@end itemize 24788 24789The next version of @command{mawk} will support @code{nextfile}. 24790 24791@cindex Sumner, Andrew 24792@cindex @command{awka} compiler for @command{awk} 24793@cindex source code, @command{awka} 24794@item @command{awka} 24795Written by Andrew Sumner, 24796@command{awka} translates @command{awk} programs into C, compiles them, 24797and links them with a library of functions that provides the core 24798@command{awk} functionality. 24799It also has a number of extensions. 24800 24801The @command{awk} translator is released under the GPL, and the library 24802is under the LGPL. 24803 24804To get @command{awka}, go to @uref{http://awka.sourceforge.net}. 24805You can reach Andrew Sumner at @email{andrew@@zbcom.net}. 24806 24807@cindex Beebe, Nelson H.F. 24808@cindex @command{pawk} profiling Bell Labs @command{awk} 24809@item @command{pawk} 24810Nelson H.F.@: Beebe at the University of Utah has modified 24811the Bell Labs @command{awk} to provide timing and profiling information. 24812It is different from @command{pgawk} 24813(@pxref{Profiling}), 24814in that it uses CPU-based profiling, not line-count 24815profiling. You may find it at either 24816@uref{ftp://ftp.math.utah.edu/pub/pawk/pawk-20020210.tar.gz} 24817or 24818@uref{http://www.math.utah.edu/pub/pawk/pawk-20020210.tar.gz}. 24819 24820@end table 24821@c ENDOFRANGE gligawk 24822@c ENDOFRANGE ingawk 24823@c ENDOFRANGE awkim 24824 24825@node Notes 24826@appendix Implementation Notes 24827@c STARTOFRANGE gawii 24828@cindex @command{gawk}, implementation issues 24829@c STARTOFRANGE impis 24830@cindex implementation issues, @command{gawk} 24831 24832This appendix contains information mainly of interest to implementors and 24833maintainers of @command{gawk}. Everything in it applies specifically to 24834@command{gawk} and not to other implementations. 24835 24836@menu 24837* Compatibility Mode:: How to disable certain @command{gawk} 24838 extensions. 24839* Additions:: Making Additions To @command{gawk}. 24840* Dynamic Extensions:: Adding new built-in functions to 24841 @command{gawk}. 24842* Future Extensions:: New features that may be implemented one day. 24843@end menu 24844 24845@node Compatibility Mode 24846@appendixsec Downward Compatibility and Debugging 24847@cindex @command{gawk}, implementation issues, downward compatibility 24848@cindex @command{gawk}, implementation issues, debugging 24849@cindex troubleshooting, @command{gawk} 24850@c first comma is part of primary 24851@cindex implementation issues, @command{gawk}, debugging 24852 24853@xref{POSIX/GNU}, 24854for a summary of the GNU extensions to the @command{awk} language and program. 24855All of these features can be turned off by invoking @command{gawk} with the 24856@option{--traditional} option or with the @option{--posix} option. 24857 24858If @command{gawk} is compiled for debugging with @samp{-DDEBUG}, then there 24859is one more option available on the command line: 24860 24861@table @code 24862@item -W parsedebug 24863@itemx --parsedebug 24864Prints out the parse stack information as the program is being parsed. 24865@end table 24866 24867This option is intended only for serious @command{gawk} developers 24868and not for the casual user. It probably has not even been compiled into 24869your version of @command{gawk}, since it slows down execution. 24870 24871@node Additions 24872@appendixsec Making Additions to @command{gawk} 24873 24874If you find that you want to enhance @command{gawk} in a significant 24875fashion, you are perfectly free to do so. That is the point of having 24876free software; the source code is available and you are free to change 24877it as you want (@pxref{Copying}). 24878 24879This @value{SECTION} discusses the ways you might want to change @command{gawk} 24880as well as any considerations you should bear in mind. 24881 24882@menu 24883* Adding Code:: Adding code to the main body of 24884 @command{gawk}. 24885* New Ports:: Porting @command{gawk} to a new operating 24886 system. 24887@end menu 24888 24889@node Adding Code 24890@appendixsubsec Adding New Features 24891 24892@c STARTOFRANGE adfgaw 24893@cindex adding, features to @command{gawk} 24894@c STARTOFRANGE fadgaw 24895@cindex features, adding to @command{gawk} 24896@c STARTOFRANGE gawadf 24897@cindex @command{gawk}, features, adding 24898You are free to add any new features you like to @command{gawk}. 24899However, if you want your changes to be incorporated into the @command{gawk} 24900distribution, there are several steps that you need to take in order to 24901make it possible for me to include your changes: 24902 24903@enumerate 1 24904@item 24905Before building the new feature into @command{gawk} itself, 24906consider writing it as an extension module 24907(@pxref{Dynamic Extensions}). 24908If that's not possible, continue with the rest of the steps in this list. 24909 24910@item 24911Get the latest version. 24912It is much easier for me to integrate changes if they are relative to 24913the most recent distributed version of @command{gawk}. If your version of 24914@command{gawk} is very old, I may not be able to integrate them at all. 24915(@xref{Getting}, 24916for information on getting the latest version of @command{gawk}.) 24917 24918@item 24919@ifnotinfo 24920Follow the @cite{GNU Coding Standards}. 24921@end ifnotinfo 24922@ifinfo 24923See @inforef{Top, , Version, standards, GNU Coding Standards}. 24924@end ifinfo 24925This document describes how GNU software should be written. If you haven't 24926read it, please do so, preferably @emph{before} starting to modify @command{gawk}. 24927(The @cite{GNU Coding Standards} are available from 24928the GNU Project's 24929@command{ftp} 24930site, at 24931@uref{ftp://ftp.gnu.org/gnu/GNUinfo/standards.text}. 24932An HTML version, suitable for reading with a WWW browser, is 24933available at 24934@uref{http://www.gnu.org/prep/standards_toc.html}. 24935Texinfo, Info, and DVI versions are also available.) 24936 24937@cindex @command{gawk}, coding style in 24938@item 24939Use the @command{gawk} coding style. 24940The C code for @command{gawk} follows the instructions in the 24941@cite{GNU Coding Standards}, with minor exceptions. The code is formatted 24942using the traditional ``K&R'' style, particularly as regards to the placement 24943of braces and the use of tabs. In brief, the coding rules for @command{gawk} 24944are as follows: 24945 24946@itemize @bullet 24947@item 24948Use ANSI/ISO style (prototype) function headers when defining functions. 24949 24950@item 24951Put the name of the function at the beginning of its own line. 24952 24953@item 24954Put the return type of the function, even if it is @code{int}, on the 24955line above the line with the name and arguments of the function. 24956 24957@item 24958Put spaces around parentheses used in control structures 24959(@code{if}, @code{while}, @code{for}, @code{do}, @code{switch}, 24960and @code{return}). 24961 24962@item 24963Do not put spaces in front of parentheses used in function calls. 24964 24965@item 24966Put spaces around all C operators and after commas in function calls. 24967 24968@item 24969Do not use the comma operator to produce multiple side effects, except 24970in @code{for} loop initialization and increment parts, and in macro bodies. 24971 24972@item 24973Use real tabs for indenting, not spaces. 24974 24975@item 24976Use the ``K&R'' brace layout style. 24977 24978@item 24979Use comparisons against @code{NULL} and @code{'\0'} in the conditions of 24980@code{if}, @code{while}, and @code{for} statements, as well as in the @code{case}s 24981of @code{switch} statements, instead of just the 24982plain pointer or character value. 24983 24984@item 24985Use the @code{TRUE}, @code{FALSE} and @code{NULL} symbolic constants 24986and the character constant @code{'\0'} where appropriate, instead of @code{1} 24987and @code{0}. 24988 24989@item 24990Use the @code{ISALPHA}, @code{ISDIGIT}, etc.@: macros, instead of the 24991traditional lowercase versions; these macros are better behaved for 24992non-ASCII character sets. 24993 24994@item 24995Provide one-line descriptive comments for each function. 24996 24997@item 24998Do not use @samp{#elif}. Many older Unix C compilers cannot handle it. 24999 25000@item 25001Do not use the @code{alloca} function for allocating memory off the stack. 25002Its use causes more portability trouble than is worth the minor benefit of not having 25003to free the storage. Instead, use @code{malloc} and @code{free}. 25004@end itemize 25005 25006@strong{Note:} 25007If I have to reformat your code to follow the coding style used in 25008@command{gawk}, I may not bother to integrate your changes at all. 25009 25010@item 25011Be prepared to sign the appropriate paperwork. 25012In order for the FSF to distribute your changes, you must either place 25013those changes in the public domain and submit a signed statement to that 25014effect, or assign the copyright in your changes to the FSF. 25015Both of these actions are easy to do and @emph{many} people have done so 25016already. If you have questions, please contact me 25017(@pxref{Bugs}), 25018or @email{gnu@@gnu.org}. 25019 25020@cindex Texinfo 25021@item 25022Update the documentation. 25023Along with your new code, please supply new sections and/or chapters 25024for this @value{DOCUMENT}. If at all possible, please use real 25025Texinfo, instead of just supplying unformatted ASCII text (although 25026even that is better than no documentation at all). 25027Conventions to be followed in @cite{@value{TITLE}} are provided 25028after the @samp{@@bye} at the end of the Texinfo source file. 25029If possible, please update the @command{man} page as well. 25030 25031You will also have to sign paperwork for your documentation changes. 25032 25033@item 25034Submit changes as context diffs or unified diffs. 25035Use @samp{diff -c -r -N} or @samp{diff -u -r -N} to compare 25036the original @command{gawk} source tree with your version. 25037(I find context diffs to be more readable but unified diffs are 25038more compact.) 25039I recommend using the GNU version of @command{diff}. 25040Send the output produced by either run of @command{diff} to me when you 25041submit your changes. 25042(@xref{Bugs}, for the electronic mail 25043information.) 25044 25045Using this format makes it easy for me to apply your changes to the 25046master version of the @command{gawk} source code (using @code{patch}). 25047If I have to apply the changes manually, using a text editor, I may 25048not do so, particularly if there are lots of changes. 25049 25050@item 25051Include an entry for the @file{ChangeLog} file with your submission. 25052This helps further minimize the amount of work I have to do, 25053making it easier for me to accept patches. 25054@end enumerate 25055 25056Although this sounds like a lot of work, please remember that while you 25057may write the new code, I have to maintain it and support it. If it 25058isn't possible for me to do that with a minimum of extra work, then I 25059probably will not. 25060@c ENDOFRANGE adfgaw 25061@c ENDOFRANGE gawadf 25062@c ENDOFRANGE fadgaw 25063 25064@node New Ports 25065@appendixsubsec Porting @command{gawk} to a New Operating System 25066@cindex portability, @command{gawk} 25067@cindex operating systems, porting @command{gawk} to 25068 25069@cindex porting @command{gawk} 25070If you want to port @command{gawk} to a new operating system, there are 25071several steps: 25072 25073@enumerate 1 25074@item 25075Follow the guidelines in 25076@ifinfo 25077@ref{Adding Code}, 25078@end ifinfo 25079@ifnotinfo 25080the previous @value{SECTION} 25081@end ifnotinfo 25082concerning coding style, submission of diffs, and so on. 25083 25084@item 25085When doing a port, bear in mind that your code must coexist peacefully 25086with the rest of @command{gawk} and the other ports. Avoid gratuitous 25087changes to the system-independent parts of the code. If at all possible, 25088avoid sprinkling @samp{#ifdef}s just for your port throughout the 25089code. 25090 25091If the changes needed for a particular system affect too much of the 25092code, I probably will not accept them. In such a case, you can, of course, 25093distribute your changes on your own, as long as you comply 25094with the GPL 25095(@pxref{Copying}). 25096 25097@item 25098A number of the files that come with @command{gawk} are maintained by other 25099people at the Free Software Foundation. Thus, you should not change them 25100unless it is for a very good reason; i.e., changes are not out of the 25101question, but changes to these files are scrutinized extra carefully. 25102The files are @file{getopt.h}, @file{getopt.c}, 25103@file{getopt1.c}, @file{regex.h}, @file{regex.c}, @file{dfa.h}, 25104@file{dfa.c}, @file{install-sh}, and @file{mkinstalldirs}. 25105 25106@item 25107Be willing to continue to maintain the port. 25108Non-Unix operating systems are supported by volunteers who maintain 25109the code needed to compile and run @command{gawk} on their systems. If noone 25110volunteers to maintain a port, it becomes unsupported and it may 25111be necessary to remove it from the distribution. 25112 25113@item 25114Supply an appropriate @file{gawkmisc.???} file. 25115Each port has its own @file{gawkmisc.???} that implements certain 25116operating system specific functions. This is cleaner than a plethora of 25117@samp{#ifdef}s scattered throughout the code. The @file{gawkmisc.c} in 25118the main source directory includes the appropriate 25119@file{gawkmisc.???} file from each subdirectory. 25120Be sure to update it as well. 25121 25122Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine 25123or operating system for the port---for example, @file{pc/gawkmisc.pc} and 25124@file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain 25125@file{gawkmisc.c}, makes it possible to move files from a port's subdirectory 25126into the main subdirectory, without accidentally destroying the real 25127@file{gawkmisc.c} file. (Currently, this is only an issue for the 25128PC operating system ports.) 25129 25130@item 25131Supply a @file{Makefile} as well as any other C source and header files that are 25132necessary for your operating system. All your code should be in a 25133separate subdirectory, with a name that is the same as, or reminiscent 25134of, either your operating system or the computer system. If possible, 25135try to structure things so that it is not necessary to move files out 25136of the subdirectory into the main source directory. If that is not 25137possible, then be sure to avoid using names for your files that 25138duplicate the names of files in the main source directory. 25139 25140@item 25141Update the documentation. 25142Please write a section (or sections) for this @value{DOCUMENT} describing the 25143installation and compilation steps needed to compile and/or install 25144@command{gawk} for your system. 25145 25146@item 25147Be prepared to sign the appropriate paperwork. 25148In order for the FSF to distribute your code, you must either place 25149your code in the public domain and submit a signed statement to that 25150effect, or assign the copyright in your code to the FSF. 25151@ifinfo 25152Both of these actions are easy to do and @emph{many} people have done so 25153already. If you have questions, please contact me, or 25154@email{gnu@@gnu.org}. 25155@end ifinfo 25156@end enumerate 25157 25158Following these steps makes it much easier to integrate your changes 25159into @command{gawk} and have them coexist happily with other 25160operating systems' code that is already there. 25161 25162In the code that you supply and maintain, feel free to use a 25163coding style and brace layout that suits your taste. 25164 25165@node Dynamic Extensions 25166@appendixsec Adding New Built-in Functions to @command{gawk} 25167@cindex Robinson, Will 25168@cindex robot, the 25169@cindex Lost In Space 25170@quotation 25171@i{Danger Will Robinson! Danger!!@* 25172Warning! Warning!}@* 25173The Robot 25174@end quotation 25175 25176@c STARTOFRANGE gladfgaw 25177@cindex @command{gawk}, functions, adding 25178@c STARTOFRANGE adfugaw 25179@cindex adding, functions to @command{gawk} 25180@c STARTOFRANGE fubadgaw 25181@cindex functions, built-in, adding to @command{gawk} 25182Beginning with @command{gawk} 3.1, it is possible to add new built-in 25183functions to @command{gawk} using dynamically loaded libraries. This 25184facility is available on systems (such as GNU/Linux) that support 25185the @code{dlopen} and @code{dlsym} functions. 25186This @value{SECTION} describes how to write and use dynamically 25187loaded extentions for @command{gawk}. 25188Experience with programming in 25189C or C++ is necessary when reading this @value{SECTION}. 25190 25191@strong{Caution:} The facilities described in this @value{SECTION} 25192are very much subject to change in the next @command{gawk} release. 25193Be aware that you may have to re-do everything, perhaps from scratch, 25194upon the next release. 25195 25196@menu 25197* Internals:: A brief look at some @command{gawk} internals. 25198* Sample Library:: A example of new functions. 25199@end menu 25200 25201@node Internals 25202@appendixsubsec A Minimal Introduction to @command{gawk} Internals 25203@c STARTOFRANGE gawint 25204@cindex @command{gawk}, internals 25205 25206The truth is that @command{gawk} was not designed for simple extensibility. 25207The facilities for adding functions using shared libraries work, but 25208are something of a ``bag on the side.'' Thus, this tour is 25209brief and simplistic; would-be @command{gawk} hackers are encouraged to 25210spend some time reading the source code before trying to write 25211extensions based on the material presented here. Of particular note 25212are the files @file{awk.h}, @file{builtin.c}, and @file{eval.c}. 25213Reading @file{awk.y} in order to see how the parse tree is built 25214would also be of use. 25215 25216@cindex @code{awk.h} file (internal) 25217With the disclaimers out of the way, the following types, structure 25218members, functions, and macros are declared in @file{awk.h} and are of 25219use when writing extensions. The next @value{SECTION} 25220shows how they are used: 25221 25222@table @code 25223@cindex floating-point, numbers, @code{AWKNUM} internal type 25224@cindex numbers, floating-point, @code{AWKNUM} internal type 25225@cindex @code{AWKNUM} internal type 25226@item AWKNUM 25227An @code{AWKNUM} is the internal type of @command{awk} 25228floating-point numbers. Typically, it is a C @code{double}. 25229 25230@cindex @code{NODE} internal type 25231@cindex strings, @code{NODE} internal type 25232@cindex numbers, @code{NODE} internal type 25233@item NODE 25234Just about everything is done using objects of type @code{NODE}. 25235These contain both strings and numbers, as well as variables and arrays. 25236 25237@cindex @code{force_number} internal function 25238@cindex numeric, values 25239@item AWKNUM force_number(NODE *n) 25240This macro forces a value to be numeric. It returns the actual 25241numeric value contained in the node. 25242It may end up calling an internal @command{gawk} function. 25243 25244@cindex @code{force_string} internal function 25245@item void force_string(NODE *n) 25246This macro guarantees that a @code{NODE}'s string value is current. 25247It may end up calling an internal @command{gawk} function. 25248It also guarantees that the string is zero-terminated. 25249 25250@c comma is part of primary 25251@cindex parameters, number of 25252@cindex @code{param_cnt} internal variable 25253@item n->param_cnt 25254The number of parameters actually passed in a function call at runtime. 25255 25256@cindex @code{stptr} internal variable 25257@cindex @code{stlen} internal variable 25258@item n->stptr 25259@itemx n->stlen 25260The data and length of a @code{NODE}'s string value, respectively. 25261The string is @emph{not} guaranteed to be zero-terminated. 25262If you need to pass the string value to a C library function, save 25263the value in @code{n->stptr[n->stlen]}, assign @code{'\0'} to it, 25264call the routine, and then restore the value. 25265 25266@cindex @code{type} internal variable 25267@item n->type 25268The type of the @code{NODE}. This is a C @code{enum}. Values should 25269be either @code{Node_var} or @code{Node_var_array} for function 25270parameters. 25271 25272@cindex @code{vname} internal variable 25273@item n->vname 25274The ``variable name'' of a node. This is not of much use inside 25275externally written extensions. 25276 25277@cindex arrays, associative, clearing 25278@cindex @code{assoc_clear} internal function 25279@item void assoc_clear(NODE *n) 25280Clears the associative array pointed to by @code{n}. 25281Make sure that @samp{n->type == Node_var_array} first. 25282 25283@cindex arrays, elements, installing 25284@cindex @code{assoc_lookup} internal function 25285@item NODE **assoc_lookup(NODE *symbol, NODE *subs, int reference) 25286Finds, and installs if necessary, array elements. 25287@code{symbol} is the array, @code{subs} is the subscript. 25288This is usually a value created with @code{tmp_string} (see below). 25289@code{reference} should be @code{TRUE} if it is an error to use the 25290value before it is created. Typically, @code{FALSE} is the 25291correct value to use from extension functions. 25292 25293@cindex strings 25294@cindex @code{make_string} internal function 25295@item NODE *make_string(char *s, size_t len) 25296Take a C string and turn it into a pointer to a @code{NODE} that 25297can be stored appropriately. This is permanent storage; understanding 25298of @command{gawk} memory management is helpful. 25299 25300@cindex numbers 25301@cindex @code{make_number} internal function 25302@item NODE *make_number(AWKNUM val) 25303Take an @code{AWKNUM} and turn it into a pointer to a @code{NODE} that 25304can be stored appropriately. This is permanent storage; understanding 25305of @command{gawk} memory management is helpful. 25306 25307@cindex @code{tmp_string} internal function 25308@item NODE *tmp_string(char *s, size_t len); 25309Take a C string and turn it into a pointer to a @code{NODE} that 25310can be stored appropriately. This is temporary storage; understanding 25311of @command{gawk} memory management is helpful. 25312 25313@cindex @code{tmp_number} internal function 25314@item NODE *tmp_number(AWKNUM val) 25315Take an @code{AWKNUM} and turn it into a pointer to a @code{NODE} that 25316can be stored appropriately. This is temporary storage; 25317understanding of @command{gawk} memory management is helpful. 25318 25319@c comma is part of primary 25320@cindex nodes, duplicating 25321@cindex @code{dupnode} internal function 25322@item NODE *dupnode(NODE *n) 25323Duplicate a node. In most cases, this increments an internal 25324reference count instead of actually duplicating the entire @code{NODE}; 25325understanding of @command{gawk} memory management is helpful. 25326 25327@cindex memory, releasing 25328@cindex @code{free_temp} internal macro 25329@item void free_temp(NODE *n) 25330This macro releases the memory associated with a @code{NODE} 25331allocated with @code{tmp_string} or @code{tmp_number}. 25332Understanding of @command{gawk} memory management is helpful. 25333 25334@cindex @code{make_builtin} internal function 25335@item void make_builtin(char *name, NODE *(*func)(NODE *), int count) 25336Register a C function pointed to by @code{func} as new built-in 25337function @code{name}. @code{name} is a regular C string. @code{count} 25338is the maximum number of arguments that the function takes. 25339The function should be written in the following manner: 25340 25341@example 25342/* do_xxx --- do xxx function for gawk */ 25343 25344NODE * 25345do_xxx(NODE *tree) 25346@{ 25347 @dots{} 25348@} 25349@end example 25350 25351@cindex arguments, retrieving 25352@cindex @code{get_argument} internal function 25353@item NODE *get_argument(NODE *tree, int i) 25354This function is called from within a C extension function to get 25355the @code{i}-th argument from the function call. 25356The first argument is argument zero. 25357 25358@c last comma is part of secondary 25359@cindex functions, return values, setting 25360@cindex @code{set_value} internal function 25361@item void set_value(NODE *tree) 25362This function is called from within a C extension function to set 25363the return value from the extension function. This value is 25364what the @command{awk} program sees as the return value from the 25365new @command{awk} function. 25366 25367@cindex @code{ERRNO} variable 25368@cindex @code{update_ERRNO} internal function 25369@item void update_ERRNO(void) 25370This function is called from within a C extension function to set 25371the value of @command{gawk}'s @code{ERRNO} variable, based on the current 25372value of the C @code{errno} variable. 25373It is provided as a convenience. 25374@end table 25375 25376An argument that is supposed to be an array needs to be handled with 25377some extra code, in case the array being passed in is actually 25378from a function parameter. 25379 25380In versions of @command{gawk} up to and including 3.1.2, the 25381following boilerplate code shows how to do this: 25382 25383@smallexample 25384NODE *the_arg; 25385 25386the_arg = get_argument(tree, 2); /* assume need 3rd arg, 0-based */ 25387 25388/* if a parameter, get it off the stack */ 25389if (the_arg->type == Node_param_list) 25390 the_arg = stack_ptr[the_arg->param_cnt]; 25391 25392/* parameter referenced an array, get it */ 25393if (the_arg->type == Node_array_ref) 25394 the_arg = the_arg->orig_array; 25395 25396/* check type */ 25397if (the_arg->type != Node_var && the_arg->type != Node_var_array) 25398 fatal("newfunc: third argument is not an array"); 25399 25400/* force it to be an array, if necessary, clear it */ 25401the_arg->type = Node_var_array; 25402assoc_clear(the_arg); 25403@end smallexample 25404 25405For versions 3.1.3 and later, the internals changed. In particular, 25406the interface was actually @emph{simplified} drastically. The 25407following boilerplate code now suffices: 25408 25409@smallexample 25410NODE *the_arg; 25411 25412the_arg = get_argument(tree, 2); /* assume need 3rd arg, 0-based */ 25413 25414/* force it to be an array: */ 25415the_arg = get_array(the_arg); 25416 25417/* if necessary, clear it: */ 25418assoc_clear(the_arg); 25419@end smallexample 25420 25421Again, you should spend time studying the @command{gawk} internals; 25422don't just blindly copy this code. 25423@c ENDOFRANGE gawint 25424 25425@node Sample Library 25426@appendixsubsec Directory and File Operation Built-ins 25427@c comma is part of primary 25428@c STARTOFRANGE chdirg 25429@cindex @code{chdir} function, implementing in @command{gawk} 25430@c comma is part of primary 25431@c STARTOFRANGE statg 25432@cindex @code{stat} function, implementing in @command{gawk} 25433@c last comma is part of secondary 25434@c STARTOFRANGE filre 25435@cindex files, information about, retrieving 25436@c STARTOFRANGE dirch 25437@cindex directories, changing 25438 25439Two useful functions that are not in @command{awk} are @code{chdir} 25440(so that an @command{awk} program can change its directory) and 25441@code{stat} (so that an @command{awk} program can gather information about 25442a file). 25443This @value{SECTION} implements these functions for @command{gawk} in an 25444external extension library. 25445 25446@menu 25447* Internal File Description:: What the new functions will do. 25448* Internal File Ops:: The code for internal file operations. 25449* Using Internal File Ops:: How to use an external extension. 25450@end menu 25451 25452@node Internal File Description 25453@appendixsubsubsec Using @code{chdir} and @code{stat} 25454 25455This @value{SECTION} shows how to use the new functions at the @command{awk} 25456level once they've been integrated into the running @command{gawk} 25457interpreter. 25458Using @code{chdir} is very straightforward. It takes one argument, 25459the new directory to change to: 25460 25461@example 25462@dots{} 25463newdir = "/home/arnold/funstuff" 25464ret = chdir(newdir) 25465if (ret < 0) @{ 25466 printf("could not change to %s: %s\n", 25467 newdir, ERRNO) > "/dev/stderr" 25468 exit 1 25469@} 25470@dots{} 25471@end example 25472 25473The return value is negative if the @code{chdir} failed, 25474and @code{ERRNO} 25475(@pxref{Built-in Variables}) 25476is set to a string indicating the error. 25477 25478Using @code{stat} is a bit more complicated. 25479The C @code{stat} function fills in a structure that has a fair 25480amount of information. 25481The right way to model this in @command{awk} is to fill in an associative 25482array with the appropriate information: 25483 25484@c broke printf for page breaking 25485@example 25486file = "/home/arnold/.profile" 25487fdata[1] = "x" # force `fdata' to be an array 25488ret = stat(file, fdata) 25489if (ret < 0) @{ 25490 printf("could not stat %s: %s\n", 25491 file, ERRNO) > "/dev/stderr" 25492 exit 1 25493@} 25494printf("size of %s is %d bytes\n", file, fdata["size"]) 25495@end example 25496 25497The @code{stat} function always clears the data array, even if 25498the @code{stat} fails. It fills in the following elements: 25499 25500@table @code 25501@item "name" 25502The name of the file that was @code{stat}'ed. 25503 25504@item "dev" 25505@itemx "ino" 25506The file's device and inode numbers, respectively. 25507 25508@item "mode" 25509The file's mode, as a numeric value. This includes both the file's 25510type and its permissions. 25511 25512@item "nlink" 25513The number of hard links (directory entries) the file has. 25514 25515@item "uid" 25516@itemx "gid" 25517The numeric user and group ID numbers of the file's owner. 25518 25519@item "size" 25520The size in bytes of the file. 25521 25522@item "blocks" 25523The number of disk blocks the file actually occupies. This may not 25524be a function of the file's size if the file has holes. 25525 25526@item "atime" 25527@itemx "mtime" 25528@itemx "ctime" 25529The file's last access, modification, and inode update times, 25530respectively. These are numeric timestamps, suitable for formatting 25531with @code{strftime} 25532(@pxref{Built-in}). 25533 25534@item "pmode" 25535The file's ``printable mode.'' This is a string representation of 25536the file's type and permissions, such as what is produced by 25537@samp{ls -l}---for example, @code{"drwxr-xr-x"}. 25538 25539@item "type" 25540A printable string representation of the file's type. The value 25541is one of the following: 25542 25543@table @code 25544@item "blockdev" 25545@itemx "chardev" 25546The file is a block or character device (``special file''). 25547 25548@ignore 25549@item "door" 25550The file is a Solaris ``door'' (special file used for 25551interprocess communications). 25552@end ignore 25553 25554@item "directory" 25555The file is a directory. 25556 25557@item "fifo" 25558The file is a named-pipe (also known as a FIFO). 25559 25560@item "file" 25561The file is just a regular file. 25562 25563@item "socket" 25564The file is an @code{AF_UNIX} (``Unix domain'') socket in the 25565filesystem. 25566 25567@item "symlink" 25568The file is a symbolic link. 25569@end table 25570@end table 25571 25572Several additional elements may be present depending upon the operating 25573system and the type of the file. You can test for them in your @command{awk} 25574program by using the @code{in} operator 25575(@pxref{Reference to Elements}): 25576 25577@table @code 25578@item "blksize" 25579The preferred block size for I/O to the file. This field is not 25580present on all POSIX-like systems in the C @code{stat} structure. 25581 25582@item "linkval" 25583If the file is a symbolic link, this element is the name of the 25584file the link points to (i.e., the value of the link). 25585 25586@item "rdev" 25587@itemx "major" 25588@itemx "minor" 25589If the file is a block or character device file, then these values 25590represent the numeric device number and the major and minor components 25591of that number, respectively. 25592@end table 25593 25594@node Internal File Ops 25595@appendixsubsubsec C Code for @code{chdir} and @code{stat} 25596 25597Here is the C code for these extensions. They were written for 25598GNU/Linux. The code needs some more work for complete portability 25599to other POSIX-compliant systems:@footnote{This version is edited 25600slightly for presentation. The complete version can be found in 25601@file{extension/filefuncs.c} in the @command{gawk} distribution.} 25602 25603@c break line for page breaking 25604@example 25605#include "awk.h" 25606 25607#include <sys/sysmacros.h> 25608 25609/* do_chdir --- provide dynamically loaded 25610 chdir() builtin for gawk */ 25611 25612static NODE * 25613do_chdir(tree) 25614NODE *tree; 25615@{ 25616 NODE *newdir; 25617 int ret = -1; 25618 25619 newdir = get_argument(tree, 0); 25620@end example 25621 25622The file includes the @code{"awk.h"} header file for definitions 25623for the @command{gawk} internals. It includes @code{<sys/sysmacros.h>} 25624for access to the @code{major} and @code{minor} macros. 25625 25626@cindex programming conventions, @command{gawk} internals 25627By convention, for an @command{awk} function @code{foo}, the function that 25628implements it is called @samp{do_foo}. The function should take 25629a @samp{NODE *} argument, usually called @code{tree}, that 25630represents the argument list to the function. The @code{newdir} 25631variable represents the new directory to change to, retrieved 25632with @code{get_argument}. Note that the first argument is 25633numbered zero. 25634 25635This code actually accomplishes the @code{chdir}. It first forces 25636the argument to be a string and passes the string value to the 25637@code{chdir} system call. If the @code{chdir} fails, @code{ERRNO} 25638is updated. 25639The result of @code{force_string} has to be freed with @code{free_temp}: 25640 25641@example 25642 if (newdir != NULL) @{ 25643 (void) force_string(newdir); 25644 ret = chdir(newdir->stptr); 25645 if (ret < 0) 25646 update_ERRNO(); 25647 25648 free_temp(newdir); 25649 @} 25650@end example 25651 25652Finally, the function returns the return value to the @command{awk} level, 25653using @code{set_value}. Then it must return a value from the call to 25654the new built-in (this value ignored by the interpreter): 25655 25656@example 25657 /* Set the return value */ 25658 set_value(tmp_number((AWKNUM) ret)); 25659 25660 /* Just to make the interpreter happy */ 25661 return tmp_number((AWKNUM) 0); 25662@} 25663@end example 25664 25665The @code{stat} built-in is more involved. First comes a function 25666that turns a numeric mode into a printable representation 25667(e.g., 644 becomes @samp{-rw-r--r--}). This is omitted here for brevity: 25668 25669@c break line for page breaking 25670@example 25671/* format_mode --- turn a stat mode field 25672 into something readable */ 25673 25674static char * 25675format_mode(fmode) 25676unsigned long fmode; 25677@{ 25678 @dots{} 25679@} 25680@end example 25681 25682Next comes the actual @code{do_stat} function itself. First come the 25683variable declarations and argument checking: 25684 25685@ignore 25686Changed message for page breaking. Used to be: 25687 "stat: called with incorrect number of arguments (%d), should be 2", 25688@end ignore 25689@example 25690/* do_stat --- provide a stat() function for gawk */ 25691 25692static NODE * 25693do_stat(tree) 25694NODE *tree; 25695@{ 25696 NODE *file, *array; 25697 struct stat sbuf; 25698 int ret; 25699 char *msg; 25700 NODE **aptr; 25701 char *pmode; /* printable mode */ 25702 char *type = "unknown"; 25703 25704 /* check arg count */ 25705 if (tree->param_cnt != 2) 25706 fatal( 25707 "stat: called with %d arguments, should be 2", 25708 tree->param_cnt); 25709@end example 25710 25711Then comes the actual work. First, we get the arguments. 25712Then, we always clear the array. To get the file information, 25713we use @code{lstat}, in case the file is a symbolic link. 25714If there's an error, we set @code{ERRNO} and return: 25715 25716@c comment made multiline for page breaking 25717@example 25718 /* 25719 * directory is first arg, 25720 * array to hold results is second 25721 */ 25722 file = get_argument(tree, 0); 25723 array = get_argument(tree, 1); 25724 25725 /* empty out the array */ 25726 assoc_clear(array); 25727 25728 /* lstat the file, if error, set ERRNO and return */ 25729 (void) force_string(file); 25730 ret = lstat(file->stptr, & sbuf); 25731 if (ret < 0) @{ 25732 update_ERRNO(); 25733 25734 set_value(tmp_number((AWKNUM) ret)); 25735 25736 free_temp(file); 25737 return tmp_number((AWKNUM) 0); 25738 @} 25739@end example 25740 25741Now comes the tedious part: filling in the array. Only a few of the 25742calls are shown here, since they all follow the same pattern: 25743 25744@example 25745 /* fill in the array */ 25746 aptr = assoc_lookup(array, tmp_string("name", 4), FALSE); 25747 *aptr = dupnode(file); 25748 25749 aptr = assoc_lookup(array, tmp_string("mode", 4), FALSE); 25750 *aptr = make_number((AWKNUM) sbuf.st_mode); 25751 25752 aptr = assoc_lookup(array, tmp_string("pmode", 5), FALSE); 25753 pmode = format_mode(sbuf.st_mode); 25754 *aptr = make_string(pmode, strlen(pmode)); 25755@end example 25756 25757When done, we free the temporary value containing the @value{FN}, 25758set the return value, and return: 25759 25760@example 25761 free_temp(file); 25762 25763 /* Set the return value */ 25764 set_value(tmp_number((AWKNUM) ret)); 25765 25766 /* Just to make the interpreter happy */ 25767 return tmp_number((AWKNUM) 0); 25768@} 25769@end example 25770 25771@cindex programming conventions, @command{gawk} internals 25772Finally, it's necessary to provide the ``glue'' that loads the 25773new function(s) into @command{gawk}. By convention, each library has 25774a routine named @code{dlload} that does the job: 25775 25776@example 25777/* dlload --- load new builtins in this library */ 25778 25779NODE * 25780dlload(tree, dl) 25781NODE *tree; 25782void *dl; 25783@{ 25784 make_builtin("chdir", do_chdir, 1); 25785 make_builtin("stat", do_stat, 2); 25786 return tmp_number((AWKNUM) 0); 25787@} 25788@end example 25789 25790And that's it! As an exercise, consider adding functions to 25791implement system calls such as @code{chown}, @code{chmod}, and @code{umask}. 25792 25793@node Using Internal File Ops 25794@appendixsubsubsec Integrating the Extensions 25795 25796@c last comma is part of secondary 25797@cindex @command{gawk}, interpreter, adding code to 25798Now that the code is written, it must be possible to add it at 25799runtime to the running @command{gawk} interpreter. First, the 25800code must be compiled. Assuming that the functions are in 25801a file named @file{filefuncs.c}, and @var{idir} is the location 25802of the @command{gawk} include files, 25803the following steps create 25804a GNU/Linux shared library: 25805 25806@example 25807$ gcc -shared -DHAVE_CONFIG_H -c -O -g -I@var{idir} filefuncs.c 25808$ ld -o filefuncs.so -shared filefuncs.o 25809@end example 25810 25811@cindex @code{extension} function (@command{gawk}) 25812Once the library exists, it is loaded by calling the @code{extension} 25813built-in function. 25814This function takes two arguments: the name of the 25815library to load and the name of a function to call when the library 25816is first loaded. This function adds the new functions to @command{gawk}. 25817It returns the value returned by the initialization function 25818within the shared library: 25819 25820@example 25821# file testff.awk 25822BEGIN @{ 25823 extension("./filefuncs.so", "dlload") 25824 25825 chdir(".") # no-op 25826 25827 data[1] = 1 # force `data' to be an array 25828 print "Info for testff.awk" 25829 ret = stat("testff.awk", data) 25830 print "ret =", ret 25831 for (i in data) 25832 printf "data[\"%s\"] = %s\n", i, data[i] 25833 print "testff.awk modified:", 25834 strftime("%m %d %y %H:%M:%S", data["mtime"]) 25835@} 25836@end example 25837 25838Here are the results of running the program: 25839 25840@example 25841$ gawk -f testff.awk 25842@print{} Info for testff.awk 25843@print{} ret = 0 25844@print{} data["blksize"] = 4096 25845@print{} data["mtime"] = 932361936 25846@print{} data["mode"] = 33188 25847@print{} data["type"] = file 25848@print{} data["dev"] = 2065 25849@print{} data["gid"] = 10 25850@print{} data["ino"] = 878597 25851@print{} data["ctime"] = 971431797 25852@print{} data["blocks"] = 2 25853@print{} data["nlink"] = 1 25854@print{} data["name"] = testff.awk 25855@print{} data["atime"] = 971608519 25856@print{} data["pmode"] = -rw-r--r-- 25857@print{} data["size"] = 607 25858@print{} data["uid"] = 2076 25859@print{} testff.awk modified: 07 19 99 08:25:36 25860@end example 25861@c ENDOFRANGE filre 25862@c ENDOFRANGE dirch 25863@c ENDOFRANGE statg 25864@c ENDOFRANGE chdirg 25865@c ENDOFRANGE gladfgaw 25866@c ENDOFRANGE adfugaw 25867@c ENDOFRANGE fubadgaw 25868 25869@node Future Extensions 25870@appendixsec Probable Future Extensions 25871@ignore 25872From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995 25873Return-Path: <emory!scalpel.netlabs.com!lwall> 25874Message-Id: <9510311732.AA28472@scalpel.netlabs.com> 25875To: arnold@skeeve.atl.ga.us (Arnold D. Robbins) 25876Subject: Re: May I quote you? 25877In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST." 25878 <m0tAHPQ-00014MC@skeeve.atl.ga.us> 25879Date: Tue, 31 Oct 95 09:32:46 -0800 25880From: Larry Wall <emory!scalpel.netlabs.com!lwall> 25881 25882: Greetings. I am working on the release of gawk 3.0. Part of it will be a 25883: thoroughly updated manual. One of the sections deals with planned future 25884: extensions and enhancements. I have the following at the beginning 25885: of it: 25886: 25887: @cindex PERL 25888: @cindex Wall, Larry 25889: @display 25890: @i{AWK is a language similar to PERL, only considerably more elegant.} @* 25891: Arnold Robbins 25892: @sp 1 25893: @i{Hey!} @* 25894: Larry Wall 25895: @end display 25896: 25897: Before I actually release this for publication, I wanted to get your 25898: permission to quote you. (Hopefully, in the spirit of much of GNU, the 25899: implied humor is visible... :-) 25900 25901I think that would be fine. 25902 25903Larry 25904@end ignore 25905@cindex PERL 25906@cindex Wall, Larry 25907@cindex Robbins, Arnold 25908@quotation 25909@i{AWK is a language similar to PERL, only considerably more elegant.}@* 25910Arnold Robbins 25911 25912@i{Hey!}@* 25913Larry Wall 25914@end quotation 25915 25916This @value{SECTION} briefly lists extensions and possible improvements 25917that indicate the directions we are 25918currently considering for @command{gawk}. The file @file{FUTURES} in the 25919@command{gawk} distribution lists these extensions as well. 25920 25921Following is a list of probable future changes visible at the 25922@command{awk} language level: 25923 25924@c these are ordered by likelihood 25925@table @asis 25926@item Loadable module interface 25927It is not clear that the @command{awk}-level interface to the 25928modules facility is as good as it should be. The interface needs to be 25929redesigned, particularly taking namespace issues into account, as 25930well as possibly including issues such as library search path order 25931and versioning. 25932 25933@item @code{RECLEN} variable for fixed-length records 25934Along with @code{FIELDWIDTHS}, this would speed up the processing of 25935fixed-length records. 25936@code{PROCINFO["RS"]} would be @code{"RS"} or @code{"RECLEN"}, 25937depending upon which kind of record processing is in effect. 25938 25939@item Additional @code{printf} specifiers 25940The 1999 ISO C standard added a number of additional @code{printf} 25941format specifiers. These should be evaluated for possible inclusion 25942in @command{gawk}. 25943 25944@ignore 25945@item A @samp{%'d} flag 25946Add @samp{%'d} for putting in commas in formatting numeric values. 25947@end ignore 25948 25949@item Databases 25950It may be possible to map a GDBM/NDBM/SDBM file into an @command{awk} array. 25951 25952@item Large character sets 25953It would be nice if @command{gawk} could handle UTF-8 and other 25954character sets that are larger than eight bits. 25955 25956@item More @code{lint} warnings 25957There are more things that could be checked for portability. 25958@end table 25959 25960Following is a list of probable improvements that will make @command{gawk}'s 25961source code easier to work with: 25962 25963@table @asis 25964@item Loadable module mechanics 25965The current extension mechanism works 25966(@pxref{Dynamic Extensions}), 25967but is rather primitive. It requires a fair amount of manual work 25968to create and integrate a loadable module. 25969Nor is the current mechanism as portable as might be desired. 25970The GNU @command{libtool} package provides a number of features that 25971would make using loadable modules much easier. 25972@command{gawk} should be changed to use @command{libtool}. 25973 25974@item Loadable module internals 25975The API to its internals that @command{gawk} ``exports'' should be revised. 25976Too many things are needlessly exposed. A new API should be designed 25977and implemented to make module writing easier. 25978 25979@item Better array subscript management 25980@command{gawk}'s management of array subscript storage could use revamping, 25981so that using the same value to index multiple arrays only 25982stores one copy of the index value. 25983 25984@item Integrating the DBUG library 25985Integrating Fred Fish's DBUG library would be helpful during development, 25986but it's a lot of work to do. 25987@end table 25988 25989Following is a list of probable improvements that will make @command{gawk} 25990perform better: 25991 25992@table @asis 25993@c NEXT ED: remove this item. awka and mawk do these respectively 25994@item Compilation of @command{awk} programs 25995@command{gawk} uses a Bison (YACC-like) 25996parser to convert the script given it into a syntax tree; the syntax 25997tree is then executed by a simple recursive evaluator. This method incurs 25998a lot of overhead, since the recursive evaluator performs many procedure 25999calls to do even the simplest things. 26000 26001It should be possible for @command{gawk} to convert the script's parse tree 26002into a C program which the user would then compile, using the normal 26003C compiler and a special @command{gawk} library to provide all the needed 26004functions (regexps, fields, associative arrays, type coercion, and so on). 26005 26006@c last comma is part of secondary 26007@cindex @command{gawk}, interpreter, adding code to 26008An easier possibility might be for an intermediate phase of @command{gawk} to 26009convert the parse tree into a linear byte code form like the one used 26010in GNU Emacs Lisp. The recursive evaluator would then be replaced by 26011a straight line byte code interpreter that would be intermediate in speed 26012between running a compiled program and doing what @command{gawk} does 26013now. 26014@end table 26015 26016Finally, 26017the programs in the test suite could use documenting in this @value{DOCUMENT}. 26018 26019@xref{Additions}, 26020if you are interested in tackling any of these projects. 26021@c ENDOFRANGE impis 26022@c ENDOFRANGE gawii 26023 26024@node Basic Concepts 26025@appendix Basic Programming Concepts 26026@cindex programming, concepts 26027@c STARTOFRANGE procon 26028@cindex programming, concepts 26029 26030This @value{APPENDIX} attempts to define some of the basic concepts 26031and terms that are used throughout the rest of this @value{DOCUMENT}. 26032As this @value{DOCUMENT} is specifically about @command{awk}, 26033and not about computer programming in general, the coverage here 26034is by necessity fairly cursory and simplistic. 26035(If you need more background, there are many 26036other introductory texts that you should refer to instead.) 26037 26038@menu 26039* Basic High Level:: The high level view. 26040* Basic Data Typing:: A very quick intro to data types. 26041* Floating Point Issues:: Stuff to know about floating-point numbers. 26042@end menu 26043 26044@node Basic High Level 26045@appendixsec What a Program Does 26046 26047@cindex processing data 26048At the most basic level, the job of a program is to process 26049some input data and produce results. 26050 26051@c NEXT ED: Use real images here 26052@iftex 26053@tex 26054\expandafter\ifx\csname graph\endcsname\relax \csname newbox\endcsname\graph\fi 26055\expandafter\ifx\csname graphtemp\endcsname\relax \csname newdimen\endcsname\graphtemp\fi 26056\setbox\graph=\vtop{\vskip 0pt\hbox{% 26057 \special{pn 20}% 26058 \special{pa 2425 200}% 26059 \special{pa 2850 200}% 26060 \special{fp}% 26061 \special{sh 1.000}% 26062 \special{pn 20}% 26063 \special{pa 2750 175}% 26064 \special{pa 2850 200}% 26065 \special{pa 2750 225}% 26066 \special{pa 2750 175}% 26067 \special{fp}% 26068 \special{pn 20}% 26069 \special{pa 850 200}% 26070 \special{pa 1250 200}% 26071 \special{fp}% 26072 \special{sh 1.000}% 26073 \special{pn 20}% 26074 \special{pa 1150 175}% 26075 \special{pa 1250 200}% 26076 \special{pa 1150 225}% 26077 \special{pa 1150 175}% 26078 \special{fp}% 26079 \special{pn 20}% 26080 \special{pa 2950 400}% 26081 \special{pa 3650 400}% 26082 \special{pa 3650 0}% 26083 \special{pa 2950 0}% 26084 \special{pa 2950 400}% 26085 \special{fp}% 26086 \special{pn 10}% 26087 \special{ar 1800 200 450 200 0 6.28319}% 26088 \graphtemp=.5ex\advance\graphtemp by 0.200in 26089 \rlap{\kern 3.300in\lower\graphtemp\hbox to 0pt{\hss Results\hss}}% 26090 \graphtemp=.5ex\advance\graphtemp by 0.200in 26091 \rlap{\kern 1.800in\lower\graphtemp\hbox to 0pt{\hss Program\hss}}% 26092 \special{pn 10}% 26093 \special{pa 0 400}% 26094 \special{pa 700 400}% 26095 \special{pa 700 0}% 26096 \special{pa 0 0}% 26097 \special{pa 0 400}% 26098 \special{fp}% 26099 \graphtemp=.5ex\advance\graphtemp by 0.200in 26100 \rlap{\kern 0.350in\lower\graphtemp\hbox to 0pt{\hss Data\hss}}% 26101 \hbox{\vrule depth0.400in width0pt height 0pt}% 26102 \kern 3.650in 26103 }% 26104}% 26105\centerline{\box\graph} 26106@end tex 26107@end iftex 26108@ifnottex 26109@example 26110 _______ 26111+------+ / \ +---------+ 26112| Data | -----> < Program > -----> | Results | 26113+------+ \_______/ +---------+ 26114@end example 26115@end ifnottex 26116 26117@cindex compiled programs 26118@cindex interpreted programs 26119The ``program'' in the figure can be either a compiled 26120program@footnote{Compiled programs are typically written 26121in lower-level languages such as C, C++, Fortran, or Ada, 26122and then translated, or @dfn{compiled}, into a form that 26123the computer can execute directly.} 26124(such as @command{ls}), 26125or it may be @dfn{interpreted}. In the latter case, a machine-executable 26126program such as @command{awk} reads your program, and then uses the 26127instructions in your program to process the data. 26128 26129@cindex programming, basic steps 26130When you write a program, it usually consists 26131of the following, very basic set of steps: 26132 26133@c NEXT ED: Use real images here 26134@iftex 26135@tex 26136\expandafter\ifx\csname graph\endcsname\relax \csname newbox\endcsname\graph\fi 26137\expandafter\ifx\csname graphtemp\endcsname\relax \csname newdimen\endcsname\graphtemp\fi 26138\setbox\graph=\vtop{\vskip 0pt\hbox{% 26139 \graphtemp=.5ex\advance\graphtemp by 0.600in 26140 \rlap{\kern 2.800in\lower\graphtemp\hbox to 0pt{\hss Yes\hss}}% 26141 \graphtemp=.5ex\advance\graphtemp by 0.100in 26142 \rlap{\kern 3.300in\lower\graphtemp\hbox to 0pt{\hss No\hss}}% 26143 \special{pn 8}% 26144 \special{pa 2100 1000}% 26145 \special{pa 1600 1000}% 26146 \special{pa 1600 1000}% 26147 \special{pa 1600 300}% 26148 \special{fp}% 26149 \special{sh 1.000}% 26150 \special{pn 8}% 26151 \special{pa 1575 400}% 26152 \special{pa 1600 300}% 26153 \special{pa 1625 400}% 26154 \special{pa 1575 400}% 26155 \special{fp}% 26156 \special{pn 8}% 26157 \special{pa 2600 500}% 26158 \special{pa 2600 900}% 26159 \special{fp}% 26160 \special{sh 1.000}% 26161 \special{pn 8}% 26162 \special{pa 2625 800}% 26163 \special{pa 2600 900}% 26164 \special{pa 2575 800}% 26165 \special{pa 2625 800}% 26166 \special{fp}% 26167 \special{pn 8}% 26168 \special{pa 3200 200}% 26169 \special{pa 4000 200}% 26170 \special{fp}% 26171 \special{sh 1.000}% 26172 \special{pn 8}% 26173 \special{pa 3900 175}% 26174 \special{pa 4000 200}% 26175 \special{pa 3900 225}% 26176 \special{pa 3900 175}% 26177 \special{fp}% 26178 \special{pn 8}% 26179 \special{pa 1400 200}% 26180 \special{pa 2100 200}% 26181 \special{fp}% 26182 \special{sh 1.000}% 26183 \special{pn 8}% 26184 \special{pa 2000 175}% 26185 \special{pa 2100 200}% 26186 \special{pa 2000 225}% 26187 \special{pa 2000 175}% 26188 \special{fp}% 26189 \special{pn 8}% 26190 \special{ar 2600 1000 400 100 0 6.28319}% 26191 \graphtemp=.5ex\advance\graphtemp by 1.000in 26192 \rlap{\kern 2.600in\lower\graphtemp\hbox to 0pt{\hss Process\hss}}% 26193 \special{pn 8}% 26194 \special{pa 2200 400}% 26195 \special{pa 3100 400}% 26196 \special{pa 3100 0}% 26197 \special{pa 2200 0}% 26198 \special{pa 2200 400}% 26199 \special{fp}% 26200 \graphtemp=.5ex\advance\graphtemp by 0.200in 26201 \rlap{\kern 2.688in\lower\graphtemp\hbox to 0pt{\hss More Data?\hss}}% 26202 \special{pn 8}% 26203 \special{ar 650 200 650 200 0 6.28319}% 26204 \graphtemp=.5ex\advance\graphtemp by 0.200in 26205 \rlap{\kern 0.613in\lower\graphtemp\hbox to 0pt{\hss Initialization\hss}}% 26206 \special{pn 8}% 26207 \special{ar 0 200 0 0 0 6.28319}% 26208 \special{pn 8}% 26209 \special{ar 4550 200 450 100 0 6.28319}% 26210 \graphtemp=.5ex\advance\graphtemp by 0.200in 26211 \rlap{\kern 4.600in\lower\graphtemp\hbox to 0pt{\hss Clean Up\hss}}% 26212 \hbox{\vrule depth1.100in width0pt height 0pt}% 26213 \kern 5.000in 26214 }% 26215}% 26216\centerline{\box\graph} 26217@end tex 26218@end iftex 26219@ifnottex 26220@example 26221 ______ 26222+----------------+ / More \ No +----------+ 26223| Initialization | -------> < Data > -------> | Clean Up | 26224+----------------+ ^ \ ? / +----------+ 26225 | +--+-+ 26226 | | Yes 26227 | | 26228 | V 26229 | +---------+ 26230 +-----+ Process | 26231 +---------+ 26232@end example 26233@end ifnottex 26234 26235@table @asis 26236@item Initialization 26237These are the things you do before actually starting to process 26238data, such as checking arguments, initializing any data you need 26239to work with, and so on. 26240This step corresponds to @command{awk}'s @code{BEGIN} rule 26241(@pxref{BEGIN/END}). 26242 26243If you were baking a cake, this might consist of laying out all the 26244mixing bowls and the baking pan, and making sure you have all the 26245ingredients that you need. 26246 26247@item Processing 26248This is where the actual work is done. Your program reads data, 26249one logical chunk at a time, and processes it as appropriate. 26250 26251In most programming languages, you have to manually manage the reading 26252of data, checking to see if there is more each time you read a chunk. 26253@command{awk}'s pattern-action paradigm 26254(@pxref{Getting Started}) 26255handles the mechanics of this for you. 26256 26257In baking a cake, the processing corresponds to the actual labor: 26258breaking eggs, mixing the flour, water, and other ingredients, and then putting the cake 26259into the oven. 26260 26261@item Clean Up 26262Once you've processed all the data, you may have things you need to 26263do before exiting. 26264This step corresponds to @command{awk}'s @code{END} rule 26265(@pxref{BEGIN/END}). 26266 26267After the cake comes out of the oven, you still have to wrap it in 26268plastic wrap to keep anyone from tasting it, as well as wash 26269the mixing bowls and utensils. 26270@end table 26271 26272@cindex algorithms 26273An @dfn{algorithm} is a detailed set of instructions necessary to accomplish 26274a task, or process data. It is much the same as a recipe for baking 26275a cake. Programs implement algorithms. Often, it is up to you to design 26276the algorithm and implement it, simultaneously. 26277 26278@cindex records 26279@cindex fields 26280The ``logical chunks'' we talked about previously are called @dfn{records}, 26281similar to the records a company keeps on employees, a school keeps for 26282students, or a doctor keeps for patients. 26283Each record has many component parts, such as first and last names, 26284date of birth, address, and so on. The component parts are referred 26285to as the @dfn{fields} of the record. 26286 26287The act of reading data is termed @dfn{input}, and that of 26288generating results, not too surprisingly, is termed @dfn{output}. 26289They are often referred to together as ``input/output,'' 26290and even more often, as ``I/O'' for short. 26291(You will also see ``input'' and ``output'' used as verbs.) 26292 26293@cindex data-driven languages 26294@c comma is part of primary 26295@cindex languages, data-driven 26296@command{awk} manages the reading of data for you, as well as the 26297breaking it up into records and fields. Your program's job is to 26298tell @command{awk} what to with the data. You do this by describing 26299@dfn{patterns} in the data to look for, and @dfn{actions} to execute 26300when those patterns are seen. This @dfn{data-driven} nature of 26301@command{awk} programs usually makes them both easier to write 26302and easier to read. 26303 26304@node Basic Data Typing 26305@appendixsec Data Values in a Computer 26306 26307@cindex variables 26308In a program, 26309you keep track of information and values in things called @dfn{variables}. 26310A variable is just a name for a given value, such as @code{first_name}, 26311@code{last_name}, @code{address}, and so on. 26312@command{awk} has several predefined variables, and it has 26313special names to refer to the current input record 26314and the fields of the record. 26315You may also group multiple 26316associated values under one name, as an array. 26317 26318@cindex values, numeric 26319@cindex values, string 26320@cindex scalar values 26321Data, particularly in @command{awk}, consists of either numeric 26322values, such as 42 or 3.1415927, or string values. 26323String values are essentially anything that's not a number, such as a name. 26324Strings are sometimes referred to as @dfn{character data}, since they 26325store the individual characters that comprise them. 26326Individual variables, as well as numeric and string variables, are 26327referred to as @dfn{scalar} values. 26328Groups of values, such as arrays, are not scalars. 26329 26330@cindex integers 26331@cindex floating-point, numbers 26332@cindex numbers, floating-point 26333Within computers, there are two kinds of numeric values: @dfn{integers} 26334and @dfn{floating-point}. 26335In school, integer values were referred to as ``whole'' numbers---that is, 26336numbers without any fractional part, such as 1, 42, or @minus{}17. 26337The advantage to integer numbers is that they represent values exactly. 26338The disadvantage is that their range is limited. On most modern systems, 26339this range is @minus{}2,147,483,648 to 2,147,483,647. 26340 26341@cindex unsigned integers 26342@cindex integers, unsigned 26343Integer values come in two flavors: @dfn{signed} and @dfn{unsigned}. 26344Signed values may be negative or positive, with the range of values just 26345described. 26346Unsigned values are always positive. On most modern systems, 26347the range is from 0 to 4,294,967,295. 26348 26349@cindex double-precision floating-point 26350@cindex single-precision floating-point 26351Floating-point numbers represent what are called ``real'' numbers; i.e., 26352those that do have a fractional part, such as 3.1415927. 26353The advantage to floating-point numbers is that they 26354can represent a much larger range of values. 26355The disadvantage is that there are numbers that they cannot represent 26356exactly. 26357@command{awk} uses @dfn{double-precision} floating-point numbers, which 26358can hold more digits than @dfn{single-precision} 26359floating-point numbers. 26360Floating-point issues are discussed more fully in 26361@ref{Floating Point Issues}. 26362 26363At the very lowest level, computers store values as groups of binary digits, 26364or @dfn{bits}. Modern computers group bits into groups of eight, called @dfn{bytes}. 26365Advanced applications sometimes have to manipulate bits directly, 26366and @command{gawk} provides functions for doing so. 26367 26368@cindex null strings 26369While you are probably used to the idea of a number without a value (i.e., zero), 26370it takes a bit more getting used to the idea of zero-length character data. 26371Nevertheless, such a thing exists. 26372It is called the @dfn{null string}. 26373The null string is character data that has no value. 26374In other words, it is empty. It is written in @command{awk} programs 26375like this: @code{""}. 26376 26377Humans are used to working in decimal; i.e., base 10. In base 10, 26378numbers go from 0 to 9, and then ``roll over'' into the next 26379column. (Remember grade school? 42 is 4 times 10 plus 2.) 26380 26381There are other number bases though. Computers commonly use base 2 26382or @dfn{binary}, base 8 or @dfn{octal}, and base 16 or @dfn{hexadecimal}. 26383In binary, each column represents two times the value in the column to 26384its right. Each column may contain either a 0 or a 1. 26385Thus, binary 1010 represents 1 times 8, plus 0 times 4, plus 1 times 2, 26386plus 0 times 1, or decimal 10. 26387Octal and hexadecimal are discussed more in 26388@ref{Nondecimal-numbers}. 26389 26390Programs are written in programming languages. 26391Hundreds, if not thousands, of programming languages exist. 26392One of the most popular is the C programming language. 26393The C language had a very strong influence on the design of 26394the @command{awk} language. 26395 26396@cindex Kernighan, Brian 26397@cindex Ritchie, Dennis 26398There have been several versions of C. The first is often referred to 26399as ``K&R'' C, after the initials of Brian Kernighan and Dennis Ritchie, 26400the authors of the first book on C. (Dennis Ritchie created the language, 26401and Brian Kernighan was one of the creators of @command{awk}.) 26402 26403In the mid-1980s, an effort began to produce an international standard 26404for C. This work culminated in 1989, with the production of the ANSI 26405standard for C. This standard became an ISO standard in 1990. 26406Where it makes sense, POSIX @command{awk} is compatible with 1990 ISO C. 26407 26408In 1999, a revised ISO C standard was approved and released. 26409Future versions of @command{gawk} will be as compatible as possible 26410with this standard. 26411 26412@node Floating Point Issues 26413@appendixsec Floating-Point Number Caveats 26414 26415As mentioned earlier, floating-point numbers represent what are called 26416``real'' numbers, i.e., those that have a fractional part. @command{awk} 26417uses double-precision floating-point numbers to represent all 26418numeric values. This @value{SECTION} describes some of the issues 26419involved in using floating-point numbers. 26420 26421There is a very nice paper on floating-point arithmetic by 26422David Goldberg, ``What Every 26423Computer Scientist Should Know About Floating-point Arithmetic,'' 26424@cite{ACM Computing Surveys} @strong{23}, 1 (1991-03), 264255-48.@footnote{@uref{http://www.validlab.com/goldberg/paper.ps}.} 26426This is worth reading if you are interested in the details, 26427but it does require a background in computer science. 26428 26429Internally, @command{awk} keeps both the numeric value 26430(double-precision floating-point) and the string value for a variable. 26431Separately, @command{awk} keeps 26432track of what type the variable has 26433(@pxref{Typing and Comparison}), 26434which plays a role in how variables are used in comparisons. 26435 26436It is important to note that the string value for a number may not 26437reflect the full value (all the digits) that the numeric value 26438actually contains. 26439The following program (@file{values.awk}) illustrates this: 26440 26441@example 26442@{ 26443 $1 = $2 + $3 26444 # see it for what it is 26445 printf("$1 = %.12g\n", $1) 26446 # use CONVFMT 26447 a = "<" $1 ">" 26448 print "a =", a 26449@group 26450 # use OFMT 26451 print "$1 =", $1 26452@end group 26453@} 26454@end example 26455 26456@noindent 26457This program shows the full value of the sum of @code{$2} and @code{$3} 26458using @code{printf}, and then prints the string values obtained 26459from both automatic conversion (via @code{CONVFMT}) and 26460from printing (via @code{OFMT}). 26461 26462Here is what happens when the program is run: 26463 26464@example 26465$ echo 2 3.654321 1.2345678 | awk -f values.awk 26466@print{} $1 = 4.8888888 26467@print{} a = <4.88889> 26468@print{} $1 = 4.88889 26469@end example 26470 26471This makes it clear that the full numeric value is different from 26472what the default string representations show. 26473 26474@code{CONVFMT}'s default value is @code{"%.6g"}, which yields a value with 26475at least six significant digits. For some applications, you might want to 26476change it to specify more precision. 26477On most modern machines, most of the time, 2647817 digits is enough to capture a floating-point number's 26479value exactly.@footnote{Pathological cases can require up to 26480752 digits (!), but we doubt that you need to worry about this.} 26481 26482@cindex floating-point 26483Unlike numbers in the abstract sense (such as what you studied in high school 26484or college math), numbers stored in computers are limited in certain ways. 26485They cannot represent an infinite number of digits, nor can they always 26486represent things exactly. 26487In particular, 26488floating-point numbers cannot 26489always represent values exactly. Here is an example: 26490 26491@example 26492$ awk '@{ printf("%010d\n", $1 * 100) @}' 26493515.79 26494@print{} 0000051579 26495515.80 26496@print{} 0000051579 26497515.81 26498@print{} 0000051580 26499515.82 26500@print{} 0000051582 26501@kbd{@value{CTL}-d} 26502@end example 26503 26504@noindent 26505This shows that some values can be represented exactly, 26506whereas others are only approximated. This is not a ``bug'' 26507in @command{awk}, but simply an artifact of how computers 26508represent numbers. 26509 26510@cindex negative zero 26511@cindex positive zero 26512@c comma is part of primary 26513@cindex zero, negative vs.@: positive 26514Another peculiarity of floating-point numbers on modern systems 26515is that they often have more than one representation for the number zero! 26516In particular, it is possible to represent ``minus zero'' as well as 26517regular, or ``positive'' zero. 26518 26519This example shows that negative and positive zero are distinct values 26520when stored internally, but that they are in fact equal to each other, 26521as well as to ``regular'' zero: 26522 26523@smallexample 26524$ gawk 'BEGIN @{ mz = -0 ; pz = 0 26525> printf "-0 = %g, +0 = %g, (-0 == +0) -> %d\n", mz, pz, mz == pz 26526> printf "mz == 0 -> %d, pz == 0 -> %d\n", mz == 0, pz == 0 26527> @}' 26528@print{} -0 = -0, +0 = 0, (-0 == +0) -> 1 26529@print{} mz == 0 -> 1, pz == 0 -> 1 26530@end smallexample 26531 26532It helps to keep this in mind should you process numeric data 26533that contains negative zero values; the fact that the zero is negative 26534is noted and can affect comparisons. 26535@c ENDOFRANGE procon 26536 26537@node Glossary 26538@unnumbered Glossary 26539 26540@table @asis 26541@item Action 26542A series of @command{awk} statements attached to a rule. If the rule's 26543pattern matches an input record, @command{awk} executes the 26544rule's action. Actions are always enclosed in curly braces. 26545(@xref{Action Overview}.) 26546 26547@cindex Spencer, Henry 26548@cindex @command{sed} utility 26549@cindex amazing @command{awk} assembler (@command{aaa}) 26550@item Amazing @command{awk} Assembler 26551Henry Spencer at the University of Toronto wrote a retargetable assembler 26552completely as @command{sed} and @command{awk} scripts. It is thousands 26553of lines long, including machine descriptions for several eight-bit 26554microcomputers. It is a good example of a program that would have been 26555better written in another language. 26556You can get it from @uref{ftp://ftp.freefriends.org/arnold/Awkstuff/aaa.tgz}. 26557 26558@cindex amazingly workable formatter (@command{awf}) 26559@cindex @command{awf} (amazingly workable formatter) program 26560@item Amazingly Workable Formatter (@command{awf}) 26561Henry Spencer at the University of Toronto wrote a formatter that accepts 26562a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting 26563commands, using @command{awk} and @command{sh}. 26564It is available over the Internet 26565from @uref{ftp://ftp.freefriends.org/arnold/Awkstuff/awf.tgz}. 26566 26567@item Anchor 26568The regexp metacharacters @samp{^} and @samp{$}, which force the match 26569to the beginning or end of the string, respectively. 26570 26571@cindex ANSI 26572@item ANSI 26573The American National Standards Institute. This organization produces 26574many standards, among them the standards for the C and C++ programming 26575languages. 26576These standards often become international standards as well. See also 26577``ISO.'' 26578 26579@item Array 26580A grouping of multiple values under the same name. 26581Most languages just provide sequential arrays. 26582@command{awk} provides associative arrays. 26583 26584@item Assertion 26585A statement in a program that a condition is true at this point in the program. 26586Useful for reasoning about how a program is supposed to behave. 26587 26588@item Assignment 26589An @command{awk} expression that changes the value of some @command{awk} 26590variable or data object. An object that you can assign to is called an 26591@dfn{lvalue}. The assigned values are called @dfn{rvalues}. 26592@xref{Assignment Ops}. 26593 26594@item Associative Array 26595Arrays in which the indices may be numbers or strings, not just 26596sequential integers in a fixed range. 26597 26598@item @command{awk} Language 26599The language in which @command{awk} programs are written. 26600 26601@item @command{awk} Program 26602An @command{awk} program consists of a series of @dfn{patterns} and 26603@dfn{actions}, collectively known as @dfn{rules}. For each input record 26604given to the program, the program's rules are all processed in turn. 26605@command{awk} programs may also contain function definitions. 26606 26607@item @command{awk} Script 26608Another name for an @command{awk} program. 26609 26610@item Bash 26611The GNU version of the standard shell 26612@ifnotinfo 26613(the @b{B}ourne-@b{A}gain @b{SH}ell). 26614@end ifnotinfo 26615@ifinfo 26616(the Bourne-Again SHell). 26617@end ifinfo 26618See also ``Bourne Shell.'' 26619 26620@item BBS 26621See ``Bulletin Board System.'' 26622 26623@item Bit 26624Short for ``Binary Digit.'' 26625All values in computer memory ultimately reduce to binary digits: values 26626that are either zero or one. 26627Groups of bits may be interpreted differently---as integers, 26628floating-point numbers, character data, addresses of other 26629memory objects, or other data. 26630@command{awk} lets you work with floating-point numbers and strings. 26631@command{gawk} lets you manipulate bit values with the built-in 26632functions described in 26633@ref{Bitwise Functions}. 26634 26635Computers are often defined by how many bits they use to represent integer 26636values. Typical systems are 32-bit systems, but 64-bit systems are 26637becoming increasingly popular, and 16-bit systems are waning in 26638popularity. 26639 26640@item Boolean Expression 26641Named after the English mathematician Boole. See also ``Logical Expression.'' 26642 26643@item Bourne Shell 26644The standard shell (@file{/bin/sh}) on Unix and Unix-like systems, 26645originally written by Steven R.@: Bourne. 26646Many shells (@command{bash}, @command{ksh}, @command{pdksh}, @command{zsh}) are 26647generally upwardly compatible with the Bourne shell. 26648 26649@item Built-in Function 26650The @command{awk} language provides built-in functions that perform various 26651numerical, I/O-related, and string computations. Examples are 26652@code{sqrt} (for the square root of a number) and @code{substr} (for a 26653substring of a string). 26654@command{gawk} provides functions for timestamp management, bit manipulation, 26655and runtime string translation. 26656(@xref{Built-in}.) 26657 26658@item Built-in Variable 26659@code{ARGC}, 26660@code{ARGV}, 26661@code{CONVFMT}, 26662@code{ENVIRON}, 26663@code{FILENAME}, 26664@code{FNR}, 26665@code{FS}, 26666@code{NF}, 26667@code{NR}, 26668@code{OFMT}, 26669@code{OFS}, 26670@code{ORS}, 26671@code{RLENGTH}, 26672@code{RSTART}, 26673@code{RS}, 26674and 26675@code{SUBSEP} 26676are the variables that have special meaning to @command{awk}. 26677In addition, 26678@code{ARGIND}, 26679@code{BINMODE}, 26680@code{ERRNO}, 26681@code{FIELDWIDTHS}, 26682@code{IGNORECASE}, 26683@code{LINT}, 26684@code{PROCINFO}, 26685@code{RT}, 26686and 26687@code{TEXTDOMAIN} 26688are the variables that have special meaning to @command{gawk}. 26689Changing some of them affects @command{awk}'s running environment. 26690(@xref{Built-in Variables}.) 26691 26692@item Braces 26693See ``Curly Braces.'' 26694 26695@item Bulletin Board System 26696A computer system allowing users to log in and read and/or leave messages 26697for other users of the system, much like leaving paper notes on a bulletin 26698board. 26699 26700@item C 26701The system programming language that most GNU software is written in. The 26702@command{awk} programming language has C-like syntax, and this @value{DOCUMENT} 26703points out similarities between @command{awk} and C when appropriate. 26704 26705In general, @command{gawk} attempts to be as similar to the 1990 version 26706of ISO C as makes sense. Future versions of @command{gawk} may adopt features 26707from the newer 1999 standard, as appropriate. 26708 26709@item C++ 26710A popular object-oriented programming language derived from C. 26711 26712@cindex ISO 8859-1 26713@cindex ISO Latin-1 26714@cindex character sets (machine character encodings) 26715@item Character Set 26716The set of numeric codes used by a computer system to represent the 26717characters (letters, numbers, punctuation, etc.) of a particular country 26718or place. The most common character set in use today is ASCII (American 26719Standard Code for Information Interchange). Many European 26720countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1). 26721 26722@cindex @command{chem} utility 26723@item CHEM 26724A preprocessor for @command{pic} that reads descriptions of molecules 26725and produces @command{pic} input for drawing them. 26726It was written in @command{awk} 26727by Brian Kernighan and Jon Bentley, and is available from 26728@uref{http://cm.bell-labs.com/netlib/typesetting/chem.gz}. 26729 26730@item Coprocess 26731A subordinate program with which two-way communications is possible. 26732 26733@cindex compiled programs 26734@item Compiler 26735A program that translates human-readable source code into 26736machine-executable object code. The object code is then executed 26737directly by the computer. 26738See also ``Interpreter.'' 26739 26740@item Compound Statement 26741A series of @command{awk} statements, enclosed in curly braces. Compound 26742statements may be nested. 26743(@xref{Statements}.) 26744 26745@item Concatenation 26746Concatenating two strings means sticking them together, one after another, 26747producing a new string. For example, the string @samp{foo} concatenated with 26748the string @samp{bar} gives the string @samp{foobar}. 26749(@xref{Concatenation}.) 26750 26751@item Conditional Expression 26752An expression using the @samp{?:} ternary operator, such as 26753@samp{@var{expr1} ? @var{expr2} : @var{expr3}}. The expression 26754@var{expr1} is evaluated; if the result is true, the value of the whole 26755expression is the value of @var{expr2}; otherwise the value is 26756@var{expr3}. In either case, only one of @var{expr2} and @var{expr3} 26757is evaluated. (@xref{Conditional Exp}.) 26758 26759@item Comparison Expression 26760A relation that is either true or false, such as @samp{(a < b)}. 26761Comparison expressions are used in @code{if}, @code{while}, @code{do}, 26762and @code{for} 26763statements, and in patterns to select which input records to process. 26764(@xref{Typing and Comparison}.) 26765 26766@item Curly Braces 26767The characters @samp{@{} and @samp{@}}. Curly braces are used in 26768@command{awk} for delimiting actions, compound statements, and function 26769bodies. 26770 26771@cindex dark corner 26772@item Dark Corner 26773An area in the language where specifications often were (or still 26774are) not clear, leading to unexpected or undesirable behavior. 26775Such areas are marked in this @value{DOCUMENT} with 26776@iftex 26777the picture of a flashlight in the margin 26778@end iftex 26779@ifnottex 26780``(d.c.)'' in the text 26781@end ifnottex 26782and are indexed under the heading ``dark corner.'' 26783 26784@item Data Driven 26785A description of @command{awk} programs, where you specify the data you 26786are interested in processing, and what to do when that data is seen. 26787 26788@item Data Objects 26789These are numbers and strings of characters. Numbers are converted into 26790strings and vice versa, as needed. 26791(@xref{Conversion}.) 26792 26793@item Deadlock 26794The situation in which two communicating processes are each waiting 26795for the other to perform an action. 26796 26797@item Double-Precision 26798An internal representation of numbers that can have fractional parts. 26799Double-precision numbers keep track of more digits than do single-precision 26800numbers, but operations on them are sometimes more expensive. This is the way 26801@command{awk} stores numeric values. It is the C type @code{double}. 26802 26803@item Dynamic Regular Expression 26804A dynamic regular expression is a regular expression written as an 26805ordinary expression. It could be a string constant, such as 26806@code{"foo"}, but it may also be an expression whose value can vary. 26807(@xref{Computed Regexps}.) 26808 26809@item Environment 26810A collection of strings, of the form @var{name@code{=}val}, that each 26811program has available to it. Users generally place values into the 26812environment in order to provide information to various programs. Typical 26813examples are the environment variables @env{HOME} and @env{PATH}. 26814 26815@item Empty String 26816See ``Null String.'' 26817 26818@cindex epoch, definition of 26819@item Epoch 26820The date used as the ``beginning of time'' for timestamps. 26821Time values in Unix systems are represented as seconds since the epoch, 26822with library functions available for converting these values into 26823standard date and time formats. 26824 26825The epoch on Unix and POSIX systems is 1970-01-01 00:00:00 UTC. 26826See also ``GMT'' and ``UTC.'' 26827 26828@item Escape Sequences 26829A special sequence of characters used for describing nonprinting 26830characters, such as @samp{\n} for newline or @samp{\033} for the ASCII 26831ESC (Escape) character. (@xref{Escape Sequences}.) 26832 26833@item FDL 26834See ``Free Documentation License.'' 26835 26836@item Field 26837When @command{awk} reads an input record, it splits the record into pieces 26838separated by whitespace (or by a separator regexp that you can 26839change by setting the built-in variable @code{FS}). Such pieces are 26840called fields. If the pieces are of fixed length, you can use the built-in 26841variable @code{FIELDWIDTHS} to describe their lengths. 26842(@xref{Field Separators}, 26843and 26844@ref{Constant Size}.) 26845 26846@item Flag 26847A variable whose truth value indicates the existence or nonexistence 26848of some condition. 26849 26850@item Floating-Point Number 26851Often referred to in mathematical terms as a ``rational'' or real number, 26852this is just a number that can have a fractional part. 26853See also ``Double-Precision'' and ``Single-Precision.'' 26854 26855@item Format 26856Format strings are used to control the appearance of output in the 26857@code{strftime} and @code{sprintf} functions, and are used in the 26858@code{printf} statement as well. Also, data conversions from numbers to strings 26859are controlled by the format string contained in the built-in variable 26860@code{CONVFMT}. (@xref{Control Letters}.) 26861 26862@item Free Documentation License 26863This document describes the terms under which this @value{DOCUMENT} 26864is published and may be copied. (@xref{GNU Free Documentation License}.) 26865 26866@item Function 26867A specialized group of statements used to encapsulate general 26868or program-specific tasks. @command{awk} has a number of built-in 26869functions, and also allows you to define your own. 26870(@xref{Functions}.) 26871 26872@item FSF 26873See ``Free Software Foundation.'' 26874 26875@cindex FSF (Free Software Foundation) 26876@cindex Free Software Foundation (FSF) 26877@cindex Stallman, Richard 26878@item Free Software Foundation 26879A nonprofit organization dedicated 26880to the production and distribution of freely distributable software. 26881It was founded by Richard M.@: Stallman, the author of the original 26882Emacs editor. GNU Emacs is the most widely used version of Emacs today. 26883 26884@item @command{gawk} 26885The GNU implementation of @command{awk}. 26886 26887@cindex GPL (General Public License) 26888@cindex General Public License (GPL) 26889@cindex GNU General Public License 26890@item General Public License 26891This document describes the terms under which @command{gawk} and its source 26892code may be distributed. (@xref{Copying}.) 26893 26894@item GMT 26895``Greenwich Mean Time.'' 26896This is the old term for UTC. 26897It is the time of day used as the epoch for Unix and POSIX systems. 26898See also ``Epoch'' and ``UTC.'' 26899 26900@cindex FSF (Free Software Foundation) 26901@cindex Free Software Foundation (FSF) 26902@cindex GNU Project 26903@item GNU 26904``GNU's not Unix''. An on-going project of the Free Software Foundation 26905to create a complete, freely distributable, POSIX-compliant computing 26906environment. 26907 26908@item GNU/Linux 26909A variant of the GNU system using the Linux kernel, instead of the 26910Free Software Foundation's Hurd kernel. 26911Linux is a stable, efficient, full-featured clone of Unix that has 26912been ported to a variety of architectures. 26913It is most popular on PC-class systems, but runs well on a variety of 26914other systems too. 26915The Linux kernel source code is available under the terms of the GNU General 26916Public License, which is perhaps its most important aspect. 26917 26918@item GPL 26919See ``General Public License.'' 26920 26921@item Hexadecimal 26922Base 16 notation, where the digits are @code{0}--@code{9} and 26923@code{A}--@code{F}, with @samp{A} 26924representing 10, @samp{B} representing 11, and so on, up to @samp{F} for 15. 26925Hexadecimal numbers are written in C using a leading @samp{0x}, 26926to indicate their base. Thus, @code{0x12} is 18 (1 times 16 plus 2). 26927 26928@item I/O 26929Abbreviation for ``Input/Output,'' the act of moving data into and/or 26930out of a running program. 26931 26932@item Input Record 26933A single chunk of data that is read in by @command{awk}. Usually, an @command{awk} input 26934record consists of one line of text. 26935(@xref{Records}.) 26936 26937@item Integer 26938A whole number, i.e., a number that does not have a fractional part. 26939 26940@item Internationalization 26941The process of writing or modifying a program so 26942that it can use multiple languages without requiring 26943further source code changes. 26944 26945@cindex interpreted programs 26946@item Interpreter 26947A program that reads human-readable source code directly, and uses 26948the instructions in it to process data and produce results. 26949@command{awk} is typically (but not always) implemented as an interpreter. 26950See also ``Compiler.'' 26951 26952@item Interval Expression 26953A component of a regular expression that lets you specify repeated matches of 26954some part of the regexp. Interval expressions were not traditionally available 26955in @command{awk} programs. 26956 26957@cindex ISO 26958@item ISO 26959The International Standards Organization. 26960This organization produces international standards for many things, including 26961programming languages, such as C and C++. 26962In the computer arena, important standards like those for C, C++, and POSIX 26963become both American national and ISO international standards simultaneously. 26964This @value{DOCUMENT} refers to Standard C as ``ISO C'' throughout. 26965 26966@item Keyword 26967In the @command{awk} language, a keyword is a word that has special 26968meaning. Keywords are reserved and may not be used as variable names. 26969 26970@command{gawk}'s keywords are: 26971@code{BEGIN}, 26972@code{END}, 26973@code{if}, 26974@code{else}, 26975@code{while}, 26976@code{do@dots{}while}, 26977@code{for}, 26978@code{for@dots{}in}, 26979@code{break}, 26980@code{continue}, 26981@code{delete}, 26982@code{next}, 26983@code{nextfile}, 26984@code{function}, 26985@code{func}, 26986and 26987@code{exit}. 26988 26989@cindex LGPL (Lesser General Public License) 26990@cindex Lesser General Public License (LGPL) 26991@cindex GNU Lesser General Public License 26992@item Lesser General Public License 26993This document describes the terms under which binary library archives 26994or shared objects, 26995and their source code may be distributed. 26996 26997@item Linux 26998See ``GNU/Linux.'' 26999 27000@item LGPL 27001See ``Lesser General Public License.'' 27002 27003@item Localization 27004The process of providing the data necessary for an 27005internationalized program to work in a particular language. 27006 27007@item Logical Expression 27008An expression using the operators for logic, AND, OR, and NOT, written 27009@samp{&&}, @samp{||}, and @samp{!} in @command{awk}. Often called Boolean 27010expressions, after the mathematician who pioneered this kind of 27011mathematical logic. 27012 27013@item Lvalue 27014An expression that can appear on the left side of an assignment 27015operator. In most languages, lvalues can be variables or array 27016elements. In @command{awk}, a field designator can also be used as an 27017lvalue. 27018 27019@item Matching 27020The act of testing a string against a regular expression. If the 27021regexp describes the contents of the string, it is said to @dfn{match} it. 27022 27023@item Metacharacters 27024Characters used within a regexp that do not stand for themselves. 27025Instead, they denote regular expression operations, such as repetition, 27026grouping, or alternation. 27027 27028@item Null String 27029A string with no characters in it. It is represented explicitly in 27030@command{awk} programs by placing two double quote characters next to 27031each other (@code{""}). It can appear in input data by having two successive 27032occurrences of the field separator appear next to each other. 27033 27034@item Number 27035A numeric-valued data object. Modern @command{awk} implementations use 27036double-precision floating-point to represent numbers. 27037Very old @command{awk} implementations use single-precision floating-point. 27038 27039@item Octal 27040Base-eight notation, where the digits are @code{0}--@code{7}. 27041Octal numbers are written in C using a leading @samp{0}, 27042to indicate their base. Thus, @code{013} is 11 (one times 8 plus 3). 27043 27044@cindex P1003.2 POSIX standard 27045@item P1003.2 27046See ``POSIX.'' 27047 27048@item Pattern 27049Patterns tell @command{awk} which input records are interesting to which 27050rules. 27051 27052A pattern is an arbitrary conditional expression against which input is 27053tested. If the condition is satisfied, the pattern is said to @dfn{match} 27054the input record. A typical pattern might compare the input record against 27055a regular expression. (@xref{Pattern Overview}.) 27056 27057@item POSIX 27058The name for a series of standards 27059@c being developed by the IEEE 27060that specify a Portable Operating System interface. The ``IX'' denotes 27061the Unix heritage of these standards. The main standard of interest for 27062@command{awk} users is 27063@cite{IEEE Standard for Information Technology, Standard 1003.2-1992, 27064Portable Operating System Interface (POSIX) Part 2: Shell and Utilities}. 27065Informally, this standard is often referred to as simply ``P1003.2.'' 27066 27067@item Precedence 27068The order in which operations are performed when operators are used 27069without explicit parentheses. 27070 27071@item Private 27072Variables and/or functions that are meant for use exclusively by library 27073functions and not for the main @command{awk} program. Special care must be 27074taken when naming such variables and functions. 27075(@xref{Library Names}.) 27076 27077@item Range (of input lines) 27078A sequence of consecutive lines from the input file(s). A pattern 27079can specify ranges of input lines for @command{awk} to process or it can 27080specify single lines. (@xref{Pattern Overview}.) 27081 27082@item Recursion 27083When a function calls itself, either directly or indirectly. 27084If this isn't clear, refer to the entry for ``recursion.'' 27085 27086@item Redirection 27087Redirection means performing input from something other than the standard input 27088stream, or performing output to something other than the standard output stream. 27089 27090You can redirect the output of the @code{print} and @code{printf} statements 27091to a file or a system command, using the @samp{>}, @samp{>>}, @samp{|}, and @samp{|&} 27092operators. You can redirect input to the @code{getline} statement using 27093the @samp{<}, @samp{|}, and @samp{|&} operators. 27094(@xref{Redirection}, 27095and @ref{Getline}.) 27096 27097@item Regexp 27098Short for @dfn{regular expression}. A regexp is a pattern that denotes a 27099set of strings, possibly an infinite set. For example, the regexp 27100@samp{R.*xp} matches any string starting with the letter @samp{R} 27101and ending with the letters @samp{xp}. In @command{awk}, regexps are 27102used in patterns and in conditional expressions. Regexps may contain 27103escape sequences. (@xref{Regexp}.) 27104 27105@item Regular Expression 27106See ``regexp.'' 27107 27108@item Regular Expression Constant 27109A regular expression constant is a regular expression written within 27110slashes, such as @code{/foo/}. This regular expression is chosen 27111when you write the @command{awk} program and cannot be changed during 27112its execution. (@xref{Regexp Usage}.) 27113 27114@item Rule 27115A segment of an @command{awk} program that specifies how to process single 27116input records. A rule consists of a @dfn{pattern} and an @dfn{action}. 27117@command{awk} reads an input record; then, for each rule, if the input record 27118satisfies the rule's pattern, @command{awk} executes the rule's action. 27119Otherwise, the rule does nothing for that input record. 27120 27121@item Rvalue 27122A value that can appear on the right side of an assignment operator. 27123In @command{awk}, essentially every expression has a value. These values 27124are rvalues. 27125 27126@item Scalar 27127A single value, be it a number or a string. 27128Regular variables are scalars; arrays and functions are not. 27129 27130@item Search Path 27131In @command{gawk}, a list of directories to search for @command{awk} program source files. 27132In the shell, a list of directories to search for executable programs. 27133 27134@item Seed 27135The initial value, or starting point, for a sequence of random numbers. 27136 27137@item @command{sed} 27138See ``Stream Editor.'' 27139 27140@item Shell 27141The command interpreter for Unix and POSIX-compliant systems. 27142The shell works both interactively, and as a programming language 27143for batch files, or shell scripts. 27144 27145@item Short-Circuit 27146The nature of the @command{awk} logical operators @samp{&&} and @samp{||}. 27147If the value of the entire expression is determinable from evaluating just 27148the lefthand side of these operators, the righthand side is not 27149evaluated. 27150(@xref{Boolean Ops}.) 27151 27152@item Side Effect 27153A side effect occurs when an expression has an effect aside from merely 27154producing a value. Assignment expressions, increment and decrement 27155expressions, and function calls have side effects. 27156(@xref{Assignment Ops}.) 27157 27158@item Single-Precision 27159An internal representation of numbers that can have fractional parts. 27160Single-precision numbers keep track of fewer digits than do double-precision 27161numbers, but operations on them are sometimes less expensive in terms of CPU time. 27162This is the type used by some very old versions of @command{awk} to store 27163numeric values. It is the C type @code{float}. 27164 27165@item Space 27166The character generated by hitting the space bar on the keyboard. 27167 27168@item Special File 27169A @value{FN} interpreted internally by @command{gawk}, instead of being handed 27170directly to the underlying operating system---for example, @file{/dev/stderr}. 27171(@xref{Special Files}.) 27172 27173@item Stream Editor 27174A program that reads records from an input stream and processes them one 27175or more at a time. This is in contrast with batch programs, which may 27176expect to read their input files in entirety before starting to do 27177anything, as well as with interactive programs which require input from the 27178user. 27179 27180@item String 27181A datum consisting of a sequence of characters, such as @samp{I am a 27182string}. Constant strings are written with double quotes in the 27183@command{awk} language and may contain escape sequences. 27184(@xref{Escape Sequences}.) 27185 27186@item Tab 27187The character generated by hitting the @kbd{TAB} key on the keyboard. 27188It usually expands to up to eight spaces upon output. 27189 27190@item Text Domain 27191A unique name that identifies an application. 27192Used for grouping messages that are translated at runtime 27193into the local language. 27194 27195@item Timestamp 27196A value in the ``seconds since the epoch'' format used by Unix 27197and POSIX systems. Used for the @command{gawk} functions 27198@code{mktime}, @code{strftime}, and @code{systime}. 27199See also ``Epoch'' and ``UTC.'' 27200 27201@cindex Linux 27202@cindex GNU/Linux 27203@cindex Unix 27204@cindex BSD-based operating systems 27205@cindex NetBSD 27206@cindex FreeBSD 27207@cindex OpenBSD 27208@item Unix 27209A computer operating system originally developed in the early 1970's at 27210AT&T Bell Laboratories. It initially became popular in universities around 27211the world and later moved into commercial environments as a software 27212development system and network server system. There are many commercial 27213versions of Unix, as well as several work-alike systems whose source code 27214is freely available (such as GNU/Linux, NetBSD, FreeBSD, and OpenBSD). 27215 27216@item UTC 27217The accepted abbreviation for ``Universal Coordinated Time.'' 27218This is standard time in Greenwich, England, which is used as a 27219reference time for day and date calculations. 27220See also ``Epoch'' and ``GMT.'' 27221 27222@item Whitespace 27223A sequence of space, TAB, or newline characters occurring inside an input 27224record or a string. 27225@end table 27226 27227@node Copying 27228@unnumbered GNU General Public License 27229@center Version 2, June 1991 27230 27231@display 27232Copyright @copyright{} 1989, 1991 Free Software Foundation, Inc. 2723359 Temple Place, Suite 330, Boston, MA 02111, USA 27234 27235Everyone is permitted to copy and distribute verbatim copies 27236of this license document, but changing it is not allowed. 27237@end display 27238 27239@c fakenode --- for prepinfo 27240@unnumberedsec Preamble 27241 27242 The licenses for most software are designed to take away your 27243freedom to share and change it. By contrast, the GNU General Public 27244License is intended to guarantee your freedom to share and change free 27245software---to make sure the software is free for all its users. This 27246General Public License applies to most of the Free Software 27247Foundation's software and to any other program whose authors commit to 27248using it. (Some other Free Software Foundation software is covered by 27249the GNU Library General Public License instead.) You can apply it to 27250your programs, too. 27251 27252 When we speak of free software, we are referring to freedom, not 27253price. Our General Public Licenses are designed to make sure that you 27254have the freedom to distribute copies of free software (and charge for 27255this service if you wish), that you receive source code or can get it 27256if you want it, that you can change the software or use pieces of it 27257in new free programs; and that you know you can do these things. 27258 27259 To protect your rights, we need to make restrictions that forbid 27260anyone to deny you these rights or to ask you to surrender the rights. 27261These restrictions translate to certain responsibilities for you if you 27262distribute copies of the software, or if you modify it. 27263 27264 For example, if you distribute copies of such a program, whether 27265gratis or for a fee, you must give the recipients all the rights that 27266you have. You must make sure that they, too, receive or can get the 27267source code. And you must show them these terms so they know their 27268rights. 27269 27270 We protect your rights with two steps: (1) copyright the software, and 27271(2) offer you this license which gives you legal permission to copy, 27272distribute and/or modify the software. 27273 27274 Also, for each author's protection and ours, we want to make certain 27275that everyone understands that there is no warranty for this free 27276software. If the software is modified by someone else and passed on, we 27277want its recipients to know that what they have is not the original, so 27278that any problems introduced by others will not reflect on the original 27279authors' reputations. 27280 27281 Finally, any free program is threatened constantly by software 27282patents. We wish to avoid the danger that redistributors of a free 27283program will individually obtain patent licenses, in effect making the 27284program proprietary. To prevent this, we have made it clear that any 27285patent must be licensed for everyone's free use or not licensed at all. 27286 27287 The precise terms and conditions for copying, distribution and 27288modification follow. 27289 27290@ifnotinfo 27291@c fakenode --- for prepinfo 27292@unnumberedsec Terms and Conditions for Copying, Distribution and Modification 27293@end ifnotinfo 27294@ifinfo 27295@center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 27296@end ifinfo 27297 27298@enumerate 0 27299@item 27300This License applies to any program or other work which contains 27301a notice placed by the copyright holder saying it may be distributed 27302under the terms of this General Public License. The ``Program'', below, 27303refers to any such program or work, and a ``work based on the Program'' 27304means either the Program or any derivative work under copyright law: 27305that is to say, a work containing the Program or a portion of it, 27306either verbatim or with modifications and/or translated into another 27307language. (Hereinafter, translation is included without limitation in 27308the term ``modification''.) Each licensee is addressed as ``you''. 27309 27310Activities other than copying, distribution and modification are not 27311covered by this License; they are outside its scope. The act of 27312running the Program is not restricted, and the output from the Program 27313is covered only if its contents constitute a work based on the 27314Program (independent of having been made by running the Program). 27315Whether that is true depends on what the Program does. 27316 27317@item 27318You may copy and distribute verbatim copies of the Program's 27319source code as you receive it, in any medium, provided that you 27320conspicuously and appropriately publish on each copy an appropriate 27321copyright notice and disclaimer of warranty; keep intact all the 27322notices that refer to this License and to the absence of any warranty; 27323and give any other recipients of the Program a copy of this License 27324along with the Program. 27325 27326You may charge a fee for the physical act of transferring a copy, and 27327you may at your option offer warranty protection in exchange for a fee. 27328 27329@item 27330You may modify your copy or copies of the Program or any portion 27331of it, thus forming a work based on the Program, and copy and 27332distribute such modifications or work under the terms of Section 1 27333above, provided that you also meet all of these conditions: 27334 27335@enumerate a 27336@item 27337You must cause the modified files to carry prominent notices 27338stating that you changed the files and the date of any change. 27339 27340@item 27341You must cause any work that you distribute or publish, that in 27342whole or in part contains or is derived from the Program or any 27343part thereof, to be licensed as a whole at no charge to all third 27344parties under the terms of this License. 27345 27346@item 27347If the modified program normally reads commands interactively 27348when run, you must cause it, when started running for such 27349interactive use in the most ordinary way, to print or display an 27350announcement including an appropriate copyright notice and a 27351notice that there is no warranty (or else, saying that you provide 27352a warranty) and that users may redistribute the program under 27353these conditions, and telling the user how to view a copy of this 27354License. (Exception: if the Program itself is interactive but 27355does not normally print such an announcement, your work based on 27356the Program is not required to print an announcement.) 27357@end enumerate 27358 27359These requirements apply to the modified work as a whole. If 27360identifiable sections of that work are not derived from the Program, 27361and can be reasonably considered independent and separate works in 27362themselves, then this License, and its terms, do not apply to those 27363sections when you distribute them as separate works. But when you 27364distribute the same sections as part of a whole which is a work based 27365on the Program, the distribution of the whole must be on the terms of 27366this License, whose permissions for other licensees extend to the 27367entire whole, and thus to each and every part regardless of who wrote it. 27368 27369Thus, it is not the intent of this section to claim rights or contest 27370your rights to work written entirely by you; rather, the intent is to 27371exercise the right to control the distribution of derivative or 27372collective works based on the Program. 27373 27374In addition, mere aggregation of another work not based on the Program 27375with the Program (or with a work based on the Program) on a volume of 27376a storage or distribution medium does not bring the other work under 27377the scope of this License. 27378 27379@item 27380You may copy and distribute the Program (or a work based on it, 27381under Section 2) in object code or executable form under the terms of 27382Sections 1 and 2 above provided that you also do one of the following: 27383 27384@enumerate a 27385@item 27386Accompany it with the complete corresponding machine-readable 27387source code, which must be distributed under the terms of Sections 273881 and 2 above on a medium customarily used for software interchange; or, 27389 27390@item 27391Accompany it with a written offer, valid for at least three 27392years, to give any third party, for a charge no more than your 27393cost of physically performing source distribution, a complete 27394machine-readable copy of the corresponding source code, to be 27395distributed under the terms of Sections 1 and 2 above on a medium 27396customarily used for software interchange; or, 27397 27398@item 27399Accompany it with the information you received as to the offer 27400to distribute corresponding source code. (This alternative is 27401allowed only for noncommercial distribution and only if you 27402received the program in object code or executable form with such 27403an offer, in accord with Subsection b above.) 27404@end enumerate 27405 27406The source code for a work means the preferred form of the work for 27407making modifications to it. For an executable work, complete source 27408code means all the source code for all modules it contains, plus any 27409associated interface definition files, plus the scripts used to 27410control compilation and installation of the executable. However, as a 27411special exception, the source code distributed need not include 27412anything that is normally distributed (in either source or binary 27413form) with the major components (compiler, kernel, and so on) of the 27414operating system on which the executable runs, unless that component 27415itself accompanies the executable. 27416 27417If distribution of executable or object code is made by offering 27418access to copy from a designated place, then offering equivalent 27419access to copy the source code from the same place counts as 27420distribution of the source code, even though third parties are not 27421compelled to copy the source along with the object code. 27422 27423@item 27424You may not copy, modify, sublicense, or distribute the Program 27425except as expressly provided under this License. Any attempt 27426otherwise to copy, modify, sublicense or distribute the Program is 27427void, and will automatically terminate your rights under this License. 27428However, parties who have received copies, or rights, from you under 27429this License will not have their licenses terminated so long as such 27430parties remain in full compliance. 27431 27432@item 27433You are not required to accept this License, since you have not 27434signed it. However, nothing else grants you permission to modify or 27435distribute the Program or its derivative works. These actions are 27436prohibited by law if you do not accept this License. Therefore, by 27437modifying or distributing the Program (or any work based on the 27438Program), you indicate your acceptance of this License to do so, and 27439all its terms and conditions for copying, distributing or modifying 27440the Program or works based on it. 27441 27442@item 27443Each time you redistribute the Program (or any work based on the 27444Program), the recipient automatically receives a license from the 27445original licensor to copy, distribute or modify the Program subject to 27446these terms and conditions. You may not impose any further 27447restrictions on the recipients' exercise of the rights granted herein. 27448You are not responsible for enforcing compliance by third parties to 27449this License. 27450 27451@item 27452If, as a consequence of a court judgment or allegation of patent 27453infringement or for any other reason (not limited to patent issues), 27454conditions are imposed on you (whether by court order, agreement or 27455otherwise) that contradict the conditions of this License, they do not 27456excuse you from the conditions of this License. If you cannot 27457distribute so as to satisfy simultaneously your obligations under this 27458License and any other pertinent obligations, then as a consequence you 27459may not distribute the Program at all. For example, if a patent 27460license would not permit royalty-free redistribution of the Program by 27461all those who receive copies directly or indirectly through you, then 27462the only way you could satisfy both it and this License would be to 27463refrain entirely from distribution of the Program. 27464 27465If any portion of this section is held invalid or unenforceable under 27466any particular circumstance, the balance of the section is intended to 27467apply and the section as a whole is intended to apply in other 27468circumstances. 27469 27470It is not the purpose of this section to induce you to infringe any 27471patents or other property right claims or to contest validity of any 27472such claims; this section has the sole purpose of protecting the 27473integrity of the free software distribution system, which is 27474implemented by public license practices. Many people have made 27475generous contributions to the wide range of software distributed 27476through that system in reliance on consistent application of that 27477system; it is up to the author/donor to decide if he or she is willing 27478to distribute software through any other system and a licensee cannot 27479impose that choice. 27480 27481This section is intended to make thoroughly clear what is believed to 27482be a consequence of the rest of this License. 27483 27484@item 27485If the distribution and/or use of the Program is restricted in 27486certain countries either by patents or by copyrighted interfaces, the 27487original copyright holder who places the Program under this License 27488may add an explicit geographical distribution limitation excluding 27489those countries, so that distribution is permitted only in or among 27490countries not thus excluded. In such case, this License incorporates 27491the limitation as if written in the body of this License. 27492 27493@item 27494The Free Software Foundation may publish revised and/or new versions 27495of the General Public License from time to time. Such new versions will 27496be similar in spirit to the present version, but may differ in detail to 27497address new problems or concerns. 27498 27499Each version is given a distinguishing version number. If the Program 27500specifies a version number of this License which applies to it and ``any 27501later version'', you have the option of following the terms and conditions 27502either of that version or of any later version published by the Free 27503Software Foundation. If the Program does not specify a version number of 27504this License, you may choose any version ever published by the Free Software 27505Foundation. 27506 27507@item 27508If you wish to incorporate parts of the Program into other free 27509programs whose distribution conditions are different, write to the author 27510to ask for permission. For software which is copyrighted by the Free 27511Software Foundation, write to the Free Software Foundation; we sometimes 27512make exceptions for this. Our decision will be guided by the two goals 27513of preserving the free status of all derivatives of our free software and 27514of promoting the sharing and reuse of software generally. 27515 27516@ifnotinfo 27517@c fakenode --- for prepinfo 27518@heading NO WARRANTY 27519@end ifnotinfo 27520@ifinfo 27521@center NO WARRANTY 27522@end ifinfo 27523 27524@item 27525BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 27526FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW@. EXCEPT WHEN 27527OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 27528PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 27529OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 27530MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE@. THE ENTIRE RISK AS 27531TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU@. SHOULD THE 27532PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 27533REPAIR OR CORRECTION. 27534 27535@item 27536IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 27537WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 27538REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 27539INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 27540OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 27541TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 27542YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 27543PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 27544POSSIBILITY OF SUCH DAMAGES. 27545@end enumerate 27546 27547@ifnotinfo 27548@c fakenode --- for prepinfo 27549@heading END OF TERMS AND CONDITIONS 27550@end ifnotinfo 27551@ifinfo 27552@center END OF TERMS AND CONDITIONS 27553@end ifinfo 27554 27555@page 27556@c fakenode --- for prepinfo 27557@unnumberedsec How to Apply These Terms to Your New Programs 27558 27559 If you develop a new program, and you want it to be of the greatest 27560possible use to the public, the best way to achieve this is to make it 27561free software which everyone can redistribute and change under these terms. 27562 27563 To do so, attach the following notices to the program. It is safest 27564to attach them to the start of each source file to most effectively 27565convey the exclusion of warranty; and each file should have at least 27566the ``copyright'' line and a pointer to where the full notice is found. 27567 27568@smallexample 27569@var{one line to give the program's name and an idea of what it does.} 27570Copyright (C) @var{year} @var{name of author} 27571 27572This program is free software; you can redistribute it and/or 27573modify it under the terms of the GNU General Public License 27574as published by the Free Software Foundation; either version 2 27575of the License, or (at your option) any later version. 27576 27577This program is distributed in the hope that it will be useful, 27578but WITHOUT ANY WARRANTY; without even the implied warranty of 27579MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE@. See the 27580GNU General Public License for more details. 27581 27582You should have received a copy of the GNU General Public License 27583along with this program; if not, write to the Free Software 27584Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111, USA. 27585@end smallexample 27586 27587Also add information on how to contact you by electronic and paper mail. 27588 27589If the program is interactive, make it output a short notice like this 27590when it starts in an interactive mode: 27591 27592@smallexample 27593Gnomovision version 69, Copyright (C) @var{year} @var{name of author} 27594Gnomovision comes with ABSOLUTELY NO WARRANTY; for details 27595type `show w'. This is free software, and you are welcome 27596to redistribute it under certain conditions; type `show c' 27597for details. 27598@end smallexample 27599 27600The hypothetical commands @samp{show w} and @samp{show c} should show 27601the appropriate parts of the General Public License. Of course, the 27602commands you use may be called something other than @samp{show w} and 27603@samp{show c}; they could even be mouse-clicks or menu items---whatever 27604suits your program. 27605 27606You should also get your employer (if you work as a programmer) or your 27607school, if any, to sign a ``copyright disclaimer'' for the program, if 27608necessary. Here is a sample; alter the names: 27609 27610@smallexample 27611@group 27612Yoyodyne, Inc., hereby disclaims all copyright 27613interest in the program `Gnomovision' 27614(which makes passes at compilers) written 27615by James Hacker. 27616 27617@var{signature of Ty Coon}, 1 April 1989 27618Ty Coon, President of Vice 27619@end group 27620@end smallexample 27621 27622This General Public License does not permit incorporating your program into 27623proprietary programs. If your program is a subroutine library, you may 27624consider it more useful to permit linking proprietary applications with the 27625library. If this is what you want to do, use the GNU Lesser General 27626Public License instead of this License. 27627 27628@node GNU Free Documentation License 27629@unnumbered GNU Free Documentation License 27630 27631@cindex FDL (Free Documentation License) 27632@cindex Free Documentation License (FDL) 27633@cindex GNU Free Documentation License 27634@center Version 1.2, November 2002 27635 27636@display 27637Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc. 2763859 Temple Place, Suite 330, Boston, MA 02111-1307, USA 27639 27640Everyone is permitted to copy and distribute verbatim copies 27641of this license document, but changing it is not allowed. 27642@end display 27643 27644@enumerate 0 27645@item 27646PREAMBLE 27647 27648The purpose of this License is to make a manual, textbook, or other 27649functional and useful document @dfn{free} in the sense of freedom: to 27650assure everyone the effective freedom to copy and redistribute it, 27651with or without modifying it, either commercially or noncommercially. 27652Secondarily, this License preserves for the author and publisher a way 27653to get credit for their work, while not being considered responsible 27654for modifications made by others. 27655 27656This License is a kind of ``copyleft'', which means that derivative 27657works of the document must themselves be free in the same sense. It 27658complements the GNU General Public License, which is a copyleft 27659license designed for free software. 27660 27661We have designed this License in order to use it for manuals for free 27662software, because free software needs free documentation: a free 27663program should come with manuals providing the same freedoms that the 27664software does. But this License is not limited to software manuals; 27665it can be used for any textual work, regardless of subject matter or 27666whether it is published as a printed book. We recommend this License 27667principally for works whose purpose is instruction or reference. 27668 27669@item 27670APPLICABILITY AND DEFINITIONS 27671 27672This License applies to any manual or other work, in any medium, that 27673contains a notice placed by the copyright holder saying it can be 27674distributed under the terms of this License. Such a notice grants a 27675world-wide, royalty-free license, unlimited in duration, to use that 27676work under the conditions stated herein. The ``Document'', below, 27677refers to any such manual or work. Any member of the public is a 27678licensee, and is addressed as ``you''. You accept the license if you 27679copy, modify or distribute the work in a way requiring permission 27680under copyright law. 27681 27682A ``Modified Version'' of the Document means any work containing the 27683Document or a portion of it, either copied verbatim, or with 27684modifications and/or translated into another language. 27685 27686A ``Secondary Section'' is a named appendix or a front-matter section 27687of the Document that deals exclusively with the relationship of the 27688publishers or authors of the Document to the Document's overall 27689subject (or to related matters) and contains nothing that could fall 27690directly within that overall subject. (Thus, if the Document is in 27691part a textbook of mathematics, a Secondary Section may not explain 27692any mathematics.) The relationship could be a matter of historical 27693connection with the subject or with related matters, or of legal, 27694commercial, philosophical, ethical or political position regarding 27695them. 27696 27697The ``Invariant Sections'' are certain Secondary Sections whose titles 27698are designated, as being those of Invariant Sections, in the notice 27699that says that the Document is released under this License. If a 27700section does not fit the above definition of Secondary then it is not 27701allowed to be designated as Invariant. The Document may contain zero 27702Invariant Sections. If the Document does not identify any Invariant 27703Sections then there are none. 27704 27705The ``Cover Texts'' are certain short passages of text that are listed, 27706as Front-Cover Texts or Back-Cover Texts, in the notice that says that 27707the Document is released under this License. A Front-Cover Text may 27708be at most 5 words, and a Back-Cover Text may be at most 25 words. 27709 27710A ``Transparent'' copy of the Document means a machine-readable copy, 27711represented in a format whose specification is available to the 27712general public, that is suitable for revising the document 27713straightforwardly with generic text editors or (for images composed of 27714pixels) generic paint programs or (for drawings) some widely available 27715drawing editor, and that is suitable for input to text formatters or 27716for automatic translation to a variety of formats suitable for input 27717to text formatters. A copy made in an otherwise Transparent file 27718format whose markup, or absence of markup, has been arranged to thwart 27719or discourage subsequent modification by readers is not Transparent. 27720An image format is not Transparent if used for any substantial amount 27721of text. A copy that is not ``Transparent'' is called ``Opaque''. 27722 27723Examples of suitable formats for Transparent copies include plain 27724@sc{ascii} without markup, Texinfo input format, La@TeX{} input 27725format, @acronym{SGML} or @acronym{XML} using a publicly available 27726@acronym{DTD}, and standard-conforming simple @acronym{HTML}, 27727PostScript or @acronym{PDF} designed for human modification. Examples 27728of transparent image formats include @acronym{PNG}, @acronym{XCF} and 27729@acronym{JPG}. Opaque formats include proprietary formats that can be 27730read and edited only by proprietary word processors, @acronym{SGML} or 27731@acronym{XML} for which the @acronym{DTD} and/or processing tools are 27732not generally available, and the machine-generated @acronym{HTML}, 27733PostScript or @acronym{PDF} produced by some word processors for 27734output purposes only. 27735 27736The ``Title Page'' means, for a printed book, the title page itself, 27737plus such following pages as are needed to hold, legibly, the material 27738this License requires to appear in the title page. For works in 27739formats which do not have any title page as such, ``Title Page'' means 27740the text near the most prominent appearance of the work's title, 27741preceding the beginning of the body of the text. 27742 27743A section ``Entitled XYZ'' means a named subunit of the Document whose 27744title either is precisely XYZ or contains XYZ in parentheses following 27745text that translates XYZ in another language. (Here XYZ stands for a 27746specific section name mentioned below, such as ``Acknowledgements'', 27747``Dedications'', ``Endorsements'', or ``History''.) To ``Preserve the Title'' 27748of such a section when you modify the Document means that it remains a 27749section ``Entitled XYZ'' according to this definition. 27750 27751The Document may include Warranty Disclaimers next to the notice which 27752states that this License applies to the Document. These Warranty 27753Disclaimers are considered to be included by reference in this 27754License, but only as regards disclaiming warranties: any other 27755implication that these Warranty Disclaimers may have is void and has 27756no effect on the meaning of this License. 27757 27758@item 27759VERBATIM COPYING 27760 27761You may copy and distribute the Document in any medium, either 27762commercially or noncommercially, provided that this License, the 27763copyright notices, and the license notice saying this License applies 27764to the Document are reproduced in all copies, and that you add no other 27765conditions whatsoever to those of this License. You may not use 27766technical measures to obstruct or control the reading or further 27767copying of the copies you make or distribute. However, you may accept 27768compensation in exchange for copies. If you distribute a large enough 27769number of copies you must also follow the conditions in section 3. 27770 27771You may also lend copies, under the same conditions stated above, and 27772you may publicly display copies. 27773 27774@item 27775COPYING IN QUANTITY 27776 27777If you publish printed copies (or copies in media that commonly have 27778printed covers) of the Document, numbering more than 100, and the 27779Document's license notice requires Cover Texts, you must enclose the 27780copies in covers that carry, clearly and legibly, all these Cover 27781Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on 27782the back cover. Both covers must also clearly and legibly identify 27783you as the publisher of these copies. The front cover must present 27784the full title with all words of the title equally prominent and 27785visible. You may add other material on the covers in addition. 27786Copying with changes limited to the covers, as long as they preserve 27787the title of the Document and satisfy these conditions, can be treated 27788as verbatim copying in other respects. 27789 27790If the required texts for either cover are too voluminous to fit 27791legibly, you should put the first ones listed (as many as fit 27792reasonably) on the actual cover, and continue the rest onto adjacent 27793pages. 27794 27795If you publish or distribute Opaque copies of the Document numbering 27796more than 100, you must either include a machine-readable Transparent 27797copy along with each Opaque copy, or state in or with each Opaque copy 27798a computer-network location from which the general network-using 27799public has access to download using public-standard network protocols 27800a complete Transparent copy of the Document, free of added material. 27801If you use the latter option, you must take reasonably prudent steps, 27802when you begin distribution of Opaque copies in quantity, to ensure 27803that this Transparent copy will remain thus accessible at the stated 27804location until at least one year after the last time you distribute an 27805Opaque copy (directly or through your agents or retailers) of that 27806edition to the public. 27807 27808It is requested, but not required, that you contact the authors of the 27809Document well before redistributing any large number of copies, to give 27810them a chance to provide you with an updated version of the Document. 27811 27812@item 27813MODIFICATIONS 27814 27815You may copy and distribute a Modified Version of the Document under 27816the conditions of sections 2 and 3 above, provided that you release 27817the Modified Version under precisely this License, with the Modified 27818Version filling the role of the Document, thus licensing distribution 27819and modification of the Modified Version to whoever possesses a copy 27820of it. In addition, you must do these things in the Modified Version: 27821 27822@enumerate A 27823@item 27824Use in the Title Page (and on the covers, if any) a title distinct 27825from that of the Document, and from those of previous versions 27826(which should, if there were any, be listed in the History section 27827of the Document). You may use the same title as a previous version 27828if the original publisher of that version gives permission. 27829 27830@item 27831List on the Title Page, as authors, one or more persons or entities 27832responsible for authorship of the modifications in the Modified 27833Version, together with at least five of the principal authors of the 27834Document (all of its principal authors, if it has fewer than five), 27835unless they release you from this requirement. 27836 27837@item 27838State on the Title page the name of the publisher of the 27839Modified Version, as the publisher. 27840 27841@item 27842Preserve all the copyright notices of the Document. 27843 27844@item 27845Add an appropriate copyright notice for your modifications 27846adjacent to the other copyright notices. 27847 27848@item 27849Include, immediately after the copyright notices, a license notice 27850giving the public permission to use the Modified Version under the 27851terms of this License, in the form shown in the Addendum below. 27852 27853@item 27854Preserve in that license notice the full lists of Invariant Sections 27855and required Cover Texts given in the Document's license notice. 27856 27857@item 27858Include an unaltered copy of this License. 27859 27860@item 27861Preserve the section Entitled ``History'', Preserve its Title, and add 27862to it an item stating at least the title, year, new authors, and 27863publisher of the Modified Version as given on the Title Page. If 27864there is no section Entitled ``History'' in the Document, create one 27865stating the title, year, authors, and publisher of the Document as 27866given on its Title Page, then add an item describing the Modified 27867Version as stated in the previous sentence. 27868 27869@item 27870Preserve the network location, if any, given in the Document for 27871public access to a Transparent copy of the Document, and likewise 27872the network locations given in the Document for previous versions 27873it was based on. These may be placed in the ``History'' section. 27874You may omit a network location for a work that was published at 27875least four years before the Document itself, or if the original 27876publisher of the version it refers to gives permission. 27877 27878@item 27879For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve 27880the Title of the section, and preserve in the section all the 27881substance and tone of each of the contributor acknowledgements and/or 27882dedications given therein. 27883 27884@item 27885Preserve all the Invariant Sections of the Document, 27886unaltered in their text and in their titles. Section numbers 27887or the equivalent are not considered part of the section titles. 27888 27889@item 27890Delete any section Entitled ``Endorsements''. Such a section 27891may not be included in the Modified Version. 27892 27893@item 27894Do not retitle any existing section to be Entitled ``Endorsements'' or 27895to conflict in title with any Invariant Section. 27896 27897@item 27898Preserve any Warranty Disclaimers. 27899@end enumerate 27900 27901If the Modified Version includes new front-matter sections or 27902appendices that qualify as Secondary Sections and contain no material 27903copied from the Document, you may at your option designate some or all 27904of these sections as invariant. To do this, add their titles to the 27905list of Invariant Sections in the Modified Version's license notice. 27906These titles must be distinct from any other section titles. 27907 27908You may add a section Entitled ``Endorsements'', provided it contains 27909nothing but endorsements of your Modified Version by various 27910parties---for example, statements of peer review or that the text has 27911been approved by an organization as the authoritative definition of a 27912standard. 27913 27914You may add a passage of up to five words as a Front-Cover Text, and a 27915passage of up to 25 words as a Back-Cover Text, to the end of the list 27916of Cover Texts in the Modified Version. Only one passage of 27917Front-Cover Text and one of Back-Cover Text may be added by (or 27918through arrangements made by) any one entity. If the Document already 27919includes a cover text for the same cover, previously added by you or 27920by arrangement made by the same entity you are acting on behalf of, 27921you may not add another; but you may replace the old one, on explicit 27922permission from the previous publisher that added the old one. 27923 27924The author(s) and publisher(s) of the Document do not by this License 27925give permission to use their names for publicity for or to assert or 27926imply endorsement of any Modified Version. 27927 27928@item 27929COMBINING DOCUMENTS 27930 27931You may combine the Document with other documents released under this 27932License, under the terms defined in section 4 above for modified 27933versions, provided that you include in the combination all of the 27934Invariant Sections of all of the original documents, unmodified, and 27935list them all as Invariant Sections of your combined work in its 27936license notice, and that you preserve all their Warranty Disclaimers. 27937 27938The combined work need only contain one copy of this License, and 27939multiple identical Invariant Sections may be replaced with a single 27940copy. If there are multiple Invariant Sections with the same name but 27941different contents, make the title of each such section unique by 27942adding at the end of it, in parentheses, the name of the original 27943author or publisher of that section if known, or else a unique number. 27944Make the same adjustment to the section titles in the list of 27945Invariant Sections in the license notice of the combined work. 27946 27947In the combination, you must combine any sections Entitled ``History'' 27948in the various original documents, forming one section Entitled 27949``History''; likewise combine any sections Entitled ``Acknowledgements'', 27950and any sections Entitled ``Dedications''. You must delete all 27951sections Entitled ``Endorsements.'' 27952 27953@item 27954COLLECTIONS OF DOCUMENTS 27955 27956You may make a collection consisting of the Document and other documents 27957released under this License, and replace the individual copies of this 27958License in the various documents with a single copy that is included in 27959the collection, provided that you follow the rules of this License for 27960verbatim copying of each of the documents in all other respects. 27961 27962You may extract a single document from such a collection, and distribute 27963it individually under this License, provided you insert a copy of this 27964License into the extracted document, and follow this License in all 27965other respects regarding verbatim copying of that document. 27966 27967@item 27968AGGREGATION WITH INDEPENDENT WORKS 27969 27970A compilation of the Document or its derivatives with other separate 27971and independent documents or works, in or on a volume of a storage or 27972distribution medium, is called an ``aggregate'' if the copyright 27973resulting from the compilation is not used to limit the legal rights 27974of the compilation's users beyond what the individual works permit. 27975When the Document is included an aggregate, this License does not 27976apply to the other works in the aggregate which are not themselves 27977derivative works of the Document. 27978 27979If the Cover Text requirement of section 3 is applicable to these 27980copies of the Document, then if the Document is less than one half of 27981the entire aggregate, the Document's Cover Texts may be placed on 27982covers that bracket the Document within the aggregate, or the 27983electronic equivalent of covers if the Document is in electronic form. 27984Otherwise they must appear on printed covers that bracket the whole 27985aggregate. 27986 27987@item 27988TRANSLATION 27989 27990Translation is considered a kind of modification, so you may 27991distribute translations of the Document under the terms of section 4. 27992Replacing Invariant Sections with translations requires special 27993permission from their copyright holders, but you may include 27994translations of some or all Invariant Sections in addition to the 27995original versions of these Invariant Sections. You may include a 27996translation of this License, and all the license notices in the 27997Document, and any Warrany Disclaimers, provided that you also include 27998the original English version of this License and the original versions 27999of those notices and disclaimers. In case of a disagreement between 28000the translation and the original version of this License or a notice 28001or disclaimer, the original version will prevail. 28002 28003If a section in the Document is Entitled ``Acknowledgements'', 28004``Dedications'', or ``History'', the requirement (section 4) to Preserve 28005its Title (section 1) will typically require changing the actual 28006title. 28007 28008@item 28009TERMINATION 28010 28011You may not copy, modify, sublicense, or distribute the Document except 28012as expressly provided for under this License. Any other attempt to 28013copy, modify, sublicense or distribute the Document is void, and will 28014automatically terminate your rights under this License. However, 28015parties who have received copies, or rights, from you under this 28016License will not have their licenses terminated so long as such 28017parties remain in full compliance. 28018 28019@item 28020FUTURE REVISIONS OF THIS LICENSE 28021 28022The Free Software Foundation may publish new, revised versions 28023of the GNU Free Documentation License from time to time. Such new 28024versions will be similar in spirit to the present version, but may 28025differ in detail to address new problems or concerns. See 28026@uref{http://www.gnu.org/copyleft/}. 28027 28028Each version of the License is given a distinguishing version number. 28029If the Document specifies that a particular numbered version of this 28030License ``or any later version'' applies to it, you have the option of 28031following the terms and conditions either of that specified version or 28032of any later version that has been published (not as a draft) by the 28033Free Software Foundation. If the Document does not specify a version 28034number of this License, you may choose any version ever published (not 28035as a draft) by the Free Software Foundation. 28036@end enumerate 28037 28038@c fakenode --- for prepinfo 28039@unnumberedsec ADDENDUM: How to use this License for your documents 28040 28041To use this License in a document you have written, include a copy of 28042the License in the document and put the following copyright and 28043license notices just after the title page: 28044 28045@smallexample 28046@group 28047 Copyright (C) @var{year} @var{your name}. 28048 Permission is granted to copy, distribute and/or modify this document 28049 under the terms of the GNU Free Documentation License, Version 1.2 28050 or any later version published by the Free Software Foundation; 28051 with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. 28052 A copy of the license is included in the section entitled ``GNU 28053 Free Documentation License''. 28054@end group 28055@end smallexample 28056 28057If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, 28058replace the ``with...Texts.'' line with this: 28059 28060@smallexample 28061@group 28062 with the Invariant Sections being @var{list their titles}, with 28063 the Front-Cover Texts being @var{list}, and with the Back-Cover Texts 28064 being @var{list}. 28065@end group 28066@end smallexample 28067 28068If you have Invariant Sections without Cover Texts, or some other 28069combination of the three, merge those two alternatives to suit the 28070situation. 28071 28072If your document contains nontrivial examples of program code, we 28073recommend releasing these examples in parallel under your choice of 28074free software license, such as the GNU General Public License, 28075to permit their use in free software. 28076 28077@c Local Variables: 28078@c ispell-local-pdict: "ispell-dict" 28079@c End: 28080 28081 28082@node Index 28083@unnumbered Index 28084@printindex cp 28085 28086@bye 28087 28088Unresolved Issues: 28089------------------ 280901. From ADR. 28091 28092 Robert J. Chassell points out that awk programs should have some indication 28093 of how to use them. It would be useful to perhaps have a "programming 28094 style" section of the manual that would include this and other tips. 28095 280962. The default AWKPATH search path should be configurable via `configure' 28097 The default and how this changes needs to be documented. 28098 28099Consistency issues: 28100 /.../ regexps are in @code, not @samp 28101 ".." strings are in @code, not @samp 28102 no @print before @dots 28103 values of expressions in the text (@code{x} has the value 15), 28104 should be in roman, not @code 28105 Use TAB and not tab 28106 Use ESC and not ESCAPE 28107 Use space and not blank to describe the space bar's character 28108 The term "blank" is thus basically reserved for "blank lines" etc. 28109 To make dark corners work, the @value{DARKCORNER} has to be outside 28110 closing `.' of a sentence and after (pxref{...}). This is 28111 a change from earlier versions. 28112 " " should have an @w{} around it 28113 Use "non-" only with language names or acronyms, or the words bug and option 28114 Use @command{ftp} when talking about anonymous ftp 28115 Use uppercase and lowercase, not "upper-case" and "lower-case" 28116 or "upper case" and "lower case" 28117 Use "single precision" and "double precision", not "single-precision" or "double-precision" 28118 Use alphanumeric, not alpha-numeric 28119 Use POSIX-compliant, not POSIX compliant 28120 Use --foo, not -Wfoo when describing long options 28121 Use "Bell Laboratories", but not "Bell Labs". 28122 Use "behavior" instead of "behaviour". 28123 Use "zeros" instead of "zeroes". 28124 Use "nonzero" not "non-zero". 28125 Use "runtime" not "run time" or "run-time". 28126 Use "command-line" not "command line". 28127 Use "online" not "on-line". 28128 Use "whitespace" not "white space". 28129 Use "Input/Output", not "input/output". Also "I/O", not "i/o". 28130 Use "lefthand"/"righthand", not "left-hand"/"right-hand". 28131 Use "workaround", not "work-around". 28132 Use "startup"/"cleanup", not "start-up"/"clean-up" 28133 Use @code{do}, and not @code{do}-@code{while}, except where 28134 actually discussing the do-while. 28135 Use "versus" in text and "vs." in index entries 28136 The words "a", "and", "as", "between", "for", "from", "in", "of", 28137 "on", "that", "the", "to", "with", and "without", 28138 should not be capitalized in @chapter, @section etc. 28139 "Into" and "How" should. 28140 Search for @dfn; make sure important items are also indexed. 28141 "e.g." should always be followed by a comma. 28142 "i.e." should always be followed by a comma. 28143 The numbers zero through ten should be spelled out, except when 28144 talking about file descriptor numbers. > 10 and < 0, it's 28145 ok to use numbers. 28146 In tables, put command-line options in @code, while in the text, 28147 put them in @option. 28148 When using @strong, use "Note:" or "Caution:" with colons and 28149 not exclamation points. Do not surround the paragraphs 28150 with @quotation ... @end quotation. 28151 For most cases, do NOT put a comma before "and", "or" or "but". 28152 But exercise taste with this rule. 28153 Don't show the awk command with a program in quotes when it's 28154 just the program. I.e. 28155 28156 { 28157 .... 28158 } 28159 28160 not 28161 awk '{ 28162 ... 28163 }' 28164 28165 Do show it when showing command-line arguments, data files, etc, even 28166 if there is no output shown. 28167 28168 Use numbered lists only to show a sequential series of steps. 28169 28170 Use @code{xxx} for the xxx operator in indexing statements, not @samp. 28171 28172Date: Wed, 13 Apr 94 15:20:52 -0400 28173From: rms@gnu.org (Richard Stallman) 28174To: gnu-prog@gnu.org 28175Subject: A reminder: no pathnames in GNU 28176 28177It's a GNU convention to use the term "file name" for the name of a 28178file, never "pathname". We use the term "path" for search paths, 28179which are lists of file names. Using it for a single file name as 28180well is potentially confusing to users. 28181 28182So please check any documentation you maintain, if you think you might 28183have used "pathname". 28184 28185Note that "file name" should be two words when it appears as ordinary 28186text. It's ok as one word when it's a metasyntactic variable, though. 28187 28188------------------------ 28189ORA uses filename, thus the macro. 28190 28191Suggestions: 28192------------ 28193Enhance FIELDWIDTHS with some way to indicate "the rest of the record". 28194E.g., a length of 0 or -1 or something. May be "n"? 28195 28196Make FIELDWIDTHS be an array? 28197 28198% Next edition: 28199% 1. Talk about common extensions, those in nawk, gawk, mawk 28200% 2. Use @code{foo} for variables and @code{foo()} for functions 28201% 3. Standardize the error messages from the functions and programs 28202% in Chapters 12 and 13. 28203% 4. Nuke the BBS stuff and use something that won't be obsolete 28204% 5. Reorg chapters 5 & 7 like so: 28205%Chapter 5: 28206% - Constants, Variables, and Conversions 28207% + Constant Expressions 28208% + Using Regular Expression Constants 28209% + Variables 28210% + Conversion of Strings and Numbers 28211% - Operators 28212% + Arithmetic Operators 28213% + String Concatenation 28214% + Assignment Expressions 28215% + Increment and Decrement Operators 28216% - Truth Values and Conditions 28217% + True and False in Awk 28218% + Boolean Expressions 28219% + Conditional Expressions 28220% - Function Calls 28221% - Operator Precedence 28222% 28223%Chapter 7: 28224% - Array Basics 28225% + Introduction to Arrays 28226% + Referring to an Array Element 28227% + Assigning Array Elements 28228% + Basic Array Example 28229% + Scanning All Elements of an Array 28230% - The delete Statement 28231% - Using Numbers to Subscript Arrays 28232% - Using Uninitialized Variables as Subscripts 28233% - Multidimensional Arrays 28234% + Scanning Multidimensional Arrays 28235% - Sorting Array Values and Indices with gawk 28236