xref: /netbsd-src/external/bsd/file/dist/doc/magic.5 (revision e89934bbf778a6d6d6894877c4da59d0c7835b0f)
1.\"	$NetBSD: magic.5,v 1.15 2017/02/10 17:53:24 christos Exp $
2.\"
3.\" $File: magic.man,v 1.90 2017/02/08 21:52:03 christos Exp $
4.Dd February 8, 2017
5.Dt MAGIC 5
6.Os
7.\" install as magic.4 on USG, magic.5 on V7, Berkeley and Linux systems.
8.Sh NAME
9.Nm magic
10.Nd file command's magic pattern file
11.Sh DESCRIPTION
12This manual page documents the format of magic files as
13used by the
14.Xr file 1
15command, version 5.30.
16The
17.Xr file 1
18command identifies the type of a file using,
19among other tests,
20a test for whether the file contains certain
21.Dq "magic patterns" .
22The database of these
23.Dq "magic patterns"
24is usually located in a binary file in
25.Pa /usr/share/misc/magic.mgc
26or a directory of source text magic pattern fragment files in
27.Pa /usr/share/misc/magic .
28The database specifies what patterns are to be tested for, what message or
29MIME type to print if a particular pattern is found,
30and additional information to extract from the file.
31.Pp
32The format of the source fragment files that are used to build this database
33is as follows:
34Each line of a fragment file specifies a test to be performed.
35A test compares the data starting at a particular offset
36in the file with a byte value, a string or a numeric value.
37If the test succeeds, a message is printed.
38The line consists of the following fields:
39.Bl -tag -width ".Dv message"
40.It Dv offset
41A number specifying the offset, in bytes, into the file of the data
42which is to be tested.
43.It Dv type
44The type of the data to be tested.
45The possible values are:
46.Bl -tag -width ".Dv lestring16"
47.It Dv byte
48A one-byte value.
49.It Dv short
50A two-byte value in this machine's native byte order.
51.It Dv long
52A four-byte value in this machine's native byte order.
53.It Dv quad
54An eight-byte value in this machine's native byte order.
55.It Dv float
56A 32-bit single precision IEEE floating point number in this machine's native byte order.
57.It Dv double
58A 64-bit double precision IEEE floating point number in this machine's native byte order.
59.It Dv string
60A string of bytes.
61The string type specification can be optionally followed
62by /[WwcCtbT]*.
63The
64.Dq W
65flag compacts whitespace in the target, which must
66contain at least one whitespace character.
67If the magic has
68.Dv n
69consecutive blanks, the target needs at least
70.Dv n
71consecutive blanks to match.
72The
73.Dq w
74flag treats every blank in the magic as an optional blank.
75The
76.Dq c
77flag specifies case insensitive matching: lower case
78characters in the magic match both lower and upper case characters in the
79target, whereas upper case characters in the magic only match upper case
80characters in the target.
81The
82.Dq C
83flag specifies case insensitive matching: upper case
84characters in the magic match both lower and upper case characters in the
85target, whereas lower case characters in the magic only match upper case
86characters in the target.
87To do a complete case insensitive match, specify both
88.Dq c
89and
90.Dq C .
91The
92.Dq t
93flag forces the test to be done for text files, while the
94.Dq b
95flag forces the test to be done for binary files.
96The
97.Dq T
98flag causes the string to be trimmed, i.e. leading and trailing whitespace
99is deleted before the string is printed.
100.It Dv pstring
101A Pascal-style string where the first byte/short/int is interpreted as the
102unsigned length.
103The length defaults to byte and can be specified as a modifier.
104The following modifiers are supported:
105.Bl -tag -compact -width B
106.It B
107A byte length (default).
108.It H
109A 2 byte big endian length.
110.It h
111A 2 byte big little length.
112.It L
113A 4 byte big endian length.
114.It l
115A 4 byte big little length.
116.It J
117The length includes itself in its count.
118.El
119The string is not NUL terminated.
120.Dq J
121is used rather than the more
122valuable
123.Dq I
124because this type of length is a feature of the JPEG
125format.
126.It Dv date
127A four-byte value interpreted as a UNIX date.
128.It Dv qdate
129A eight-byte value interpreted as a UNIX date.
130.It Dv ldate
131A four-byte value interpreted as a UNIX-style date, but interpreted as
132local time rather than UTC.
133.It Dv qldate
134An eight-byte value interpreted as a UNIX-style date, but interpreted as
135local time rather than UTC.
136.It Dv qwdate
137An eight-byte value interpreted as a Windows-style date.
138.It Dv beid3
139A 32-bit ID3 length in big-endian byte order.
140.It Dv beshort
141A two-byte value in big-endian byte order.
142.It Dv belong
143A four-byte value in big-endian byte order.
144.It Dv bequad
145An eight-byte value in big-endian byte order.
146.It Dv befloat
147A 32-bit single precision IEEE floating point number in big-endian byte order.
148.It Dv bedouble
149A 64-bit double precision IEEE floating point number in big-endian byte order.
150.It Dv bedate
151A four-byte value in big-endian byte order,
152interpreted as a Unix date.
153.It Dv beqdate
154An eight-byte value in big-endian byte order,
155interpreted as a Unix date.
156.It Dv beldate
157A four-byte value in big-endian byte order,
158interpreted as a UNIX-style date, but interpreted as local time rather
159than UTC.
160.It Dv beqldate
161An eight-byte value in big-endian byte order,
162interpreted as a UNIX-style date, but interpreted as local time rather
163than UTC.
164.It Dv beqwdate
165An eight-byte value in big-endian byte order,
166interpreted as a Windows-style date.
167.It Dv bestring16
168A two-byte unicode (UCS16) string in big-endian byte order.
169.It Dv leid3
170A 32-bit ID3 length in little-endian byte order.
171.It Dv leshort
172A two-byte value in little-endian byte order.
173.It Dv lelong
174A four-byte value in little-endian byte order.
175.It Dv lequad
176An eight-byte value in little-endian byte order.
177.It Dv lefloat
178A 32-bit single precision IEEE floating point number in little-endian byte order.
179.It Dv ledouble
180A 64-bit double precision IEEE floating point number in little-endian byte order.
181.It Dv ledate
182A four-byte value in little-endian byte order,
183interpreted as a UNIX date.
184.It Dv leqdate
185An eight-byte value in little-endian byte order,
186interpreted as a UNIX date.
187.It Dv leldate
188A four-byte value in little-endian byte order,
189interpreted as a UNIX-style date, but interpreted as local time rather
190than UTC.
191.It Dv leqldate
192An eight-byte value in little-endian byte order,
193interpreted as a UNIX-style date, but interpreted as local time rather
194than UTC.
195.It Dv leqwdate
196An eight-byte value in little-endian byte order,
197interpreted as a Windows-style date.
198.It Dv lestring16
199A two-byte unicode (UCS16) string in little-endian byte order.
200.It Dv melong
201A four-byte value in middle-endian (PDP-11) byte order.
202.It Dv medate
203A four-byte value in middle-endian (PDP-11) byte order,
204interpreted as a UNIX date.
205.It Dv meldate
206A four-byte value in middle-endian (PDP-11) byte order,
207interpreted as a UNIX-style date, but interpreted as local time rather
208than UTC.
209.It Dv indirect
210Starting at the given offset, consult the magic database again.
211The offset of the
212.Dv indirect
213magic is by default absolute in the file, but one can specify
214.Dv /r
215to indicate that the offset is relative from the beginning of the entry.
216.It Dv name
217Define a
218.Dq named
219magic instance that can be called from another
220.Dv use
221magic entry, like a subroutine call.
222Named instance direct magic offsets are relative to the offset of the
223previous matched entry, but indirect offsets are relative to the beginning
224of the file as usual.
225Named magic entries always match.
226.It Dv use
227Recursively call the named magic starting from the current offset.
228If the name of the referenced begins with a
229.Dv ^
230then the endianness of the magic is switched; if the magic mentioned
231.Dv leshort
232for example,
233it is treated as
234.Dv beshort
235and vice versa.
236This is useful to avoid duplicating the rules for different endianness.
237.It Dv regex
238A regular expression match in extended POSIX regular expression syntax
239(like egrep).
240Regular expressions can take exponential time to process, and their
241performance is hard to predict, so their use is discouraged.
242When used in production environments, their performance
243should be carefully checked.
244The size of the string to search should also be limited by specifying
245.Dv /<length> ,
246to avoid performance issues scanning long files.
247The type specification can also be optionally followed by
248.Dv /[c][s][l] .
249The
250.Dq c
251flag makes the match case insensitive, while the
252.Dq s
253flag update the offset to the start offset of the match, rather than the end.
254The
255.Dq l
256modifier, changes the limit of length to mean number of lines instead of a
257byte count.
258Lines are delimited by the platforms native line delimiter.
259When a line count is specified, an implicit byte count also computed assuming
260each line is 80 characters long.
261If neither a byte or line count is specified, the search is limited automatically
262to 8KiB.
263.Dv ^
264and
265.Dv $
266match the beginning and end of individual lines, respectively,
267not beginning and end of file.
268.It Dv search
269A literal string search starting at the given offset.
270The same modifier flags can be used as for string patterns.
271The search expression must contain the range in the form
272.Dv /number,
273that is the number of positions at which the match will be
274attempted, starting from the start offset.
275This is suitable for
276searching larger binary expressions with variable offsets, using
277.Dv \e
278escapes for special characters.
279The order of modifier and number is not relevant.
280.It Dv default
281This is intended to be used with the test
282.Em x
283(which is always true) and it has no type.
284It matches when no other test at that continuation level has matched before.
285Clearing that matched tests for a continuation level, can be done using the
286.Dv clear
287test.
288.It Dv clear
289This test is always true and clears the match flag for that continuation level.
290It is intended to be used with the
291.Dv default
292test.
293.El
294.Pp
295For compatibility with the Single
296.Ux
297Standard, the type specifiers
298.Dv dC
299and
300.Dv d1
301are equivalent to
302.Dv byte ,
303the type specifiers
304.Dv uC
305and
306.Dv u1
307are equivalent to
308.Dv ubyte ,
309the type specifiers
310.Dv dS
311and
312.Dv d2
313are equivalent to
314.Dv short ,
315the type specifiers
316.Dv uS
317and
318.Dv u2
319are equivalent to
320.Dv ushort ,
321the type specifiers
322.Dv dI ,
323.Dv dL ,
324and
325.Dv d4
326are equivalent to
327.Dv long ,
328the type specifiers
329.Dv uI ,
330.Dv uL ,
331and
332.Dv u4
333are equivalent to
334.Dv ulong ,
335the type specifier
336.Dv d8
337is equivalent to
338.Dv quad ,
339the type specifier
340.Dv u8
341is equivalent to
342.Dv uquad ,
343and the type specifier
344.Dv s
345is equivalent to
346.Dv string .
347In addition, the type specifier
348.Dv dQ
349is equivalent to
350.Dv quad
351and the type specifier
352.Dv uQ
353is equivalent to
354.Dv uquad .
355.Pp
356Each top-level magic pattern (see below for an explanation of levels)
357is classified as text or binary according to the types used.
358Types
359.Dq regex
360and
361.Dq search
362are classified as text tests, unless non-printable characters are used
363in the pattern.
364All other tests are classified as binary.
365A top-level
366pattern is considered to be a test text when all its patterns are text
367patterns; otherwise, it is considered to be a binary pattern.
368When
369matching a file, binary patterns are tried first; if no match is
370found, and the file looks like text, then its encoding is determined
371and the text patterns are tried.
372.Pp
373The numeric types may optionally be followed by
374.Dv \*[Am]
375and a numeric value,
376to specify that the value is to be AND'ed with the
377numeric value before any comparisons are done.
378Prepending a
379.Dv u
380to the type indicates that ordered comparisons should be unsigned.
381.It Dv test
382The value to be compared with the value from the file.
383If the type is
384numeric, this value
385is specified in C form; if it is a string, it is specified as a C string
386with the usual escapes permitted (e.g. \en for new-line).
387.Pp
388Numeric values
389may be preceded by a character indicating the operation to be performed.
390It may be
391.Dv = ,
392to specify that the value from the file must equal the specified value,
393.Dv \*[Lt] ,
394to specify that the value from the file must be less than the specified
395value,
396.Dv \*[Gt] ,
397to specify that the value from the file must be greater than the specified
398value,
399.Dv \*[Am] ,
400to specify that the value from the file must have set all of the bits
401that are set in the specified value,
402.Dv ^ ,
403to specify that the value from the file must have clear any of the bits
404that are set in the specified value, or
405.Dv ~ ,
406the value specified after is negated before tested.
407.Dv x ,
408to specify that any value will match.
409If the character is omitted, it is assumed to be
410.Dv = .
411Operators
412.Dv \*[Am] ,
413.Dv ^ ,
414and
415.Dv ~
416don't work with floats and doubles.
417The operator
418.Dv !\&
419specifies that the line matches if the test does
420.Em not
421succeed.
422.Pp
423Numeric values are specified in C form; e.g.
424.Dv 13
425is decimal,
426.Dv 013
427is octal, and
428.Dv 0x13
429is hexadecimal.
430.Pp
431Numeric operations are not performed on date types, instead the numeric
432value is interpreted as an offset.
433.Pp
434For string values, the string from the
435file must match the specified string.
436The operators
437.Dv = ,
438.Dv \*[Lt]
439and
440.Dv \*[Gt]
441(but not
442.Dv \*[Am] )
443can be applied to strings.
444The length used for matching is that of the string argument
445in the magic file.
446This means that a line can match any non-empty string (usually used to
447then print the string), with
448.Em \*[Gt]\e0
449(because all non-empty strings are greater than the empty string).
450.Pp
451Dates are treated as numerical values in the respective internal
452representation.
453.Pp
454The special test
455.Em x
456always evaluates to true.
457.It Dv message
458The message to be printed if the comparison succeeds.
459If the string contains a
460.Xr printf 3
461format specification, the value from the file (with any specified masking
462performed) is printed using the message as the format string.
463If the string begins with
464.Dq \eb ,
465the message printed is the remainder of the string with no whitespace
466added before it: multiple matches are normally separated by a single
467space.
468.El
469.Pp
470An APPLE 4+4 character APPLE creator and type can be specified as:
471.Bd -literal -offset indent
472!:apple	CREATYPE
473.Ed
474.Pp
475A MIME type is given on a separate line, which must be the next
476non-blank or comment line after the magic line that identifies the
477file type, and has the following format:
478.Bd -literal -offset indent
479!:mime	MIMETYPE
480.Ed
481.Pp
482i.e. the literal string
483.Dq !:mime
484followed by the MIME type.
485.Pp
486An optional strength can be supplied on a separate line which refers to
487the current magic description using the following format:
488.Bd -literal -offset indent
489!:strength OP VALUE
490.Ed
491.Pp
492The operand
493.Dv OP
494can be:
495.Dv + ,
496.Dv - ,
497.Dv * ,
498or
499.Dv /
500and
501.Dv VALUE
502is a constant between 0 and 255.
503This constant is applied using the specified operand
504to the currently computed default magic strength.
505.Pp
506Some file formats contain additional information which is to be printed
507along with the file type or need additional tests to determine the true
508file type.
509These additional tests are introduced by one or more
510.Em \*[Gt]
511characters preceding the offset.
512The number of
513.Em \*[Gt]
514on the line indicates the level of the test; a line with no
515.Em \*[Gt]
516at the beginning is considered to be at level 0.
517Tests are arranged in a tree-like hierarchy:
518if the test on a line at level
519.Em n
520succeeds, all following tests at level
521.Em n+1
522are performed, and the messages printed if the tests succeed, until a line
523with level
524.Em n
525(or less) appears.
526For more complex files, one can use empty messages to get just the
527"if/then" effect, in the following way:
528.Bd -literal -offset indent
5290      string   MZ
530\*[Gt]0x18  leshort  \*[Lt]0x40   MS-DOS executable
531\*[Gt]0x18  leshort  \*[Gt]0x3f   extended PC executable (e.g., MS Windows)
532.Ed
533.Pp
534Offsets do not need to be constant, but can also be read from the file
535being examined.
536If the first character following the last
537.Em \*[Gt]
538is a
539.Em \&(
540then the string after the parenthesis is interpreted as an indirect offset.
541That means that the number after the parenthesis is used as an offset in
542the file.
543The value at that offset is read, and is used again as an offset
544in the file.
545Indirect offsets are of the form:
546.Em (( x [[.,][bislBISL]][+\-][ y ]) .
547The value of
548.Em x
549is used as an offset in the file.
550A byte, id3 length, short or long is read at that offset depending on the
551.Em [bislBISLm]
552type specifier.
553The value is treated as signed if
554.Dq ,
555is specified or unsigned if
556.Dq .
557is specified.
558The capitalized types interpret the number as a big endian
559value, whereas the small letter versions interpret the number as a little
560endian value;
561the
562.Em m
563type interprets the number as a middle endian (PDP-11) value.
564To that number the value of
565.Em y
566is added and the result is used as an offset in the file.
567The default type if one is not specified is long.
568.Pp
569That way variable length structures can be examined:
570.Bd -literal -offset indent
571# MS Windows executables are also valid MS-DOS executables
5720           string  MZ
573\*[Gt]0x18       leshort \*[Lt]0x40   MZ executable (MS-DOS)
574# skip the whole block below if it is not an extended executable
575\*[Gt]0x18       leshort \*[Gt]0x3f
576\*[Gt]\*[Gt](0x3c.l)  string  PE\e0\e0  PE executable (MS-Windows)
577\*[Gt]\*[Gt](0x3c.l)  string  LX\e0\e0  LX executable (OS/2)
578.Ed
579.Pp
580This strategy of examining has a drawback: you must make sure that you
581eventually print something, or users may get empty output (such as when
582there is neither PE\e0\e0 nor LE\e0\e0 in the above example).
583.Pp
584If this indirect offset cannot be used directly, simple calculations are
585possible: appending
586.Em [+-*/%\*[Am]|^]number
587inside parentheses allows one to modify
588the value read from the file before it is used as an offset:
589.Bd -literal -offset indent
590# MS Windows executables are also valid MS-DOS executables
5910           string  MZ
592# sometimes, the value at 0x18 is less that 0x40 but there's still an
593# extended executable, simply appended to the file
594\*[Gt]0x18       leshort \*[Lt]0x40
595\*[Gt]\*[Gt](4.s*512) leshort 0x014c  COFF executable (MS-DOS, DJGPP)
596\*[Gt]\*[Gt](4.s*512) leshort !0x014c MZ executable (MS-DOS)
597.Ed
598.Pp
599Sometimes you do not know the exact offset as this depends on the length or
600position (when indirection was used before) of preceding fields.
601You can specify an offset relative to the end of the last up-level
602field using
603.Sq \*[Am]
604as a prefix to the offset:
605.Bd -literal -offset indent
6060           string  MZ
607\*[Gt]0x18       leshort \*[Gt]0x3f
608\*[Gt]\*[Gt](0x3c.l)  string  PE\e0\e0    PE executable (MS-Windows)
609# immediately following the PE signature is the CPU type
610\*[Gt]\*[Gt]\*[Gt]\*[Am]0       leshort 0x14c     for Intel 80386
611\*[Gt]\*[Gt]\*[Gt]\*[Am]0       leshort 0x184     for DEC Alpha
612.Ed
613.Pp
614Indirect and relative offsets can be combined:
615.Bd -literal -offset indent
6160             string  MZ
617\*[Gt]0x18         leshort \*[Lt]0x40
618\*[Gt]\*[Gt](4.s*512)   leshort !0x014c MZ executable (MS-DOS)
619# if it's not COFF, go back 512 bytes and add the offset taken
620# from byte 2/3, which is yet another way of finding the start
621# of the extended executable
622\*[Gt]\*[Gt]\*[Gt]\*[Am](2.s-514) string  LE      LE executable (MS Windows VxD driver)
623.Ed
624.Pp
625Or the other way around:
626.Bd -literal -offset indent
6270                 string  MZ
628\*[Gt]0x18             leshort \*[Gt]0x3f
629\*[Gt]\*[Gt](0x3c.l)        string  LE\e0\e0  LE executable (MS-Windows)
630# at offset 0x80 (-4, since relative offsets start at the end
631# of the up-level match) inside the LE header, we find the absolute
632# offset to the code area, where we look for a specific signature
633\*[Gt]\*[Gt]\*[Gt](\*[Am]0x7c.l+0x26) string  UPX     \eb, UPX compressed
634.Ed
635.Pp
636Or even both!
637.Bd -literal -offset indent
6380                string  MZ
639\*[Gt]0x18            leshort \*[Gt]0x3f
640\*[Gt]\*[Gt](0x3c.l)       string  LE\e0\e0 LE executable (MS-Windows)
641# at offset 0x58 inside the LE header, we find the relative offset
642# to a data area where we look for a specific signature
643\*[Gt]\*[Gt]\*[Gt]\*[Am](\*[Am]0x54.l-3)  string  UNACE  \eb, ACE self-extracting archive
644.Ed
645.Pp
646If you have to deal with offset/length pairs in your file, even the
647second value in a parenthesized expression can be taken from the file itself,
648using another set of parentheses.
649Note that this additional indirect offset is always relative to the
650start of the main indirect offset.
651.Bd -literal -offset indent
6520                 string       MZ
653\*[Gt]0x18             leshort      \*[Gt]0x3f
654\*[Gt]\*[Gt](0x3c.l)        string       PE\e0\e0 PE executable (MS-Windows)
655# search for the PE section called ".idata"...
656\*[Gt]\*[Gt]\*[Gt]\*[Am]0xf4          search/0x140 .idata
657# ...and go to the end of it, calculated from start+length;
658# these are located 14 and 10 bytes after the section name
659\*[Gt]\*[Gt]\*[Gt]\*[Gt](\*[Am]0xe.l+(-4)) string       PK\e3\e4 \eb, ZIP self-extracting archive
660.Ed
661.Pp
662If you have a list of known values at a particular continuation level,
663and you want to provide a switch-like default case:
664.Bd -literal -offset indent
665# clear that continuation level match
666\*[Gt]18	clear
667\*[Gt]18	lelong	1	one
668\*[Gt]18	lelong	2	two
669\*[Gt]18	default	x
670# print default match
671\*[Gt]\*[Gt]18	lelong	x	unmatched 0x%x
672.Ed
673.Sh SEE ALSO
674.Xr file 1
675\- the command that reads this file.
676.Sh BUGS
677The formats
678.Dv long ,
679.Dv belong ,
680.Dv lelong ,
681.Dv melong ,
682.Dv short ,
683.Dv beshort ,
684and
685.Dv leshort
686do not depend on the length of the C data types
687.Dv short
688and
689.Dv long
690on the platform, even though the Single
691.Ux
692Specification implies that they do.  However, as OS X Mountain Lion has
693passed the Single
694.Ux
695Specification validation suite, and supplies a version of
696.Xr file 1
697in which they do not depend on the sizes of the C data types and that is
698built for a 64-bit environment in which
699.Dv long
700is 8 bytes rather than 4 bytes, presumably the validation suite does not
701test whether, for example
702.Dv long
703refers to an item with the same size as the C data type
704.Dv long .
705There should probably be
706.Dv type
707names
708.Dv int8 ,
709.Dv uint8 ,
710.Dv int16 ,
711.Dv uint16 ,
712.Dv int32 ,
713.Dv uint32 ,
714.Dv int64 ,
715and
716.Dv uint64 ,
717and specified-byte-order variants of them,
718to make it clearer that those types have specified widths.
719.\"
720.\" From: guy@sun.uucp (Guy Harris)
721.\" Newsgroups: net.bugs.usg
722.\" Subject: /etc/magic's format isn't well documented
723.\" Message-ID: <2752@sun.uucp>
724.\" Date: 3 Sep 85 08:19:07 GMT
725.\" Organization: Sun Microsystems, Inc.
726.\" Lines: 136
727.\"
728.\" Here's a manual page for the format accepted by the "file" made by adding
729.\" the changes I posted to the S5R2 version.
730.\"
731.\" Modified for Ian Darwin's version of the file command.
732