xref: /netbsd-src/external/bsd/file/dist/doc/magic.5 (revision de156261f77addff1a09db902c36dc569b59495b)
1.\"	$NetBSD: magic.5,v 1.2 2009/05/08 16:39:46 christos Exp $
2.\"
3.\" $File: magic.man,v 1.59 2008/11/06 23:22:53 christos Exp $
4.Dd August 30, 2008
5.Dt MAGIC 5
6.Os
7.\" install as magic.4 on USG, magic.5 on V7, Berkeley and Linux systems.
8.Sh NAME
9.Nm magic
10.Nd file command's magic pattern file
11.Sh DESCRIPTION
12This manual page documents the format of the magic file as
13used by the
14.Xr file 1
15command, version 5.03.
16The
17.Xr file 1
18command identifies the type of a file using,
19among other tests,
20a test for whether the file contains certain
21.Dq "magic patterns" .
22The file
23.Pa /usr/share/misc/magic
24specifies what patterns are to be tested for, what message or
25MIME type to print if a particular pattern is found,
26and additional information to extract from the file.
27.Pp
28Each line of the file specifies a test to be performed.
29A test compares the data starting at a particular offset
30in the file with a byte value, a string or a numeric value.
31If the test succeeds, a message is printed.
32The line consists of the following fields:
33.Bl -tag -width ".Dv message"
34.It Dv offset
35A number specifying the offset, in bytes, into the file of the data
36which is to be tested.
37.It Dv type
38The type of the data to be tested.
39The possible values are:
40.Bl -tag -width ".Dv lestring16"
41.It Dv byte
42A one-byte value.
43.It Dv short
44A two-byte value in this machine's native byte order.
45.It Dv long
46A four-byte value in this machine's native byte order.
47.It Dv quad
48An eight-byte value in this machine's native byte order.
49.It Dv float
50A 32-bit single precision IEEE floating point number in this machine's native byte order.
51.It Dv double
52A 64-bit double precision IEEE floating point number in this machine's native byte order.
53.It Dv string
54A string of bytes.
55The string type specification can be optionally followed
56by /[Bbc]*.
57The
58.Dq B
59flag compacts whitespace in the target, which must
60contain at least one whitespace character.
61If the magic has
62.Dv n
63consecutive blanks, the target needs at least
64.Dv n
65consecutive blanks to match.
66The
67.Dq b
68flag treats every blank in the target as an optional blank.
69Finally the
70.Dq c
71flag, specifies case insensitive matching: lowercase
72characters in the magic match both lower and upper case characters in the
73target, whereas upper case characters in the magic only match uppercase
74characters in the target.
75.It Dv pstring
76A Pascal-style string where the first byte is interpreted as the an
77unsigned length.
78The string is not NUL terminated.
79.It Dv date
80A four-byte value interpreted as a UNIX date.
81.It Dv qdate
82A eight-byte value interpreted as a UNIX date.
83.It Dv ldate
84A four-byte value interpreted as a UNIX-style date, but interpreted as
85local time rather than UTC.
86.It Dv qldate
87An eight-byte value interpreted as a UNIX-style date, but interpreted as
88local time rather than UTC.
89.It Dv beid3
90A 32-bit ID3 length in big-endian byte order.
91.It Dv beshort
92A two-byte value in big-endian byte order.
93.It Dv belong
94A four-byte value in big-endian byte order.
95.It Dv bequad
96An eight-byte value in big-endian byte order.
97.It Dv befloat
98A 32-bit single precision IEEE floating point number in big-endian byte order.
99.It Dv bedouble
100A 64-bit double precision IEEE floating point number in big-endian byte order.
101.It Dv bedate
102A four-byte value in big-endian byte order,
103interpreted as a Unix date.
104.It Dv beqdate
105An eight-byte value in big-endian byte order,
106interpreted as a Unix date.
107.It Dv beldate
108A four-byte value in big-endian byte order,
109interpreted as a UNIX-style date, but interpreted as local time rather
110than UTC.
111.It Dv beqldate
112An eight-byte value in big-endian byte order,
113interpreted as a UNIX-style date, but interpreted as local time rather
114than UTC.
115.It Dv bestring16
116A two-byte unicode (UCS16) string in big-endian byte order.
117.It Dv leid3
118A 32-bit ID3 length in little-endian byte order.
119.It Dv leshort
120A two-byte value in little-endian byte order.
121.It Dv lelong
122A four-byte value in little-endian byte order.
123.It Dv lequad
124An eight-byte value in little-endian byte order.
125.It Dv lefloat
126A 32-bit single precision IEEE floating point number in little-endian byte order.
127.It Dv ledouble
128A 64-bit double precision IEEE floating point number in little-endian byte order.
129.It Dv ledate
130A four-byte value in little-endian byte order,
131interpreted as a UNIX date.
132.It Dv leqdate
133An eight-byte value in little-endian byte order,
134interpreted as a UNIX date.
135.It Dv leldate
136A four-byte value in little-endian byte order,
137interpreted as a UNIX-style date, but interpreted as local time rather
138than UTC.
139.It Dv leqldate
140An eight-byte value in little-endian byte order,
141interpreted as a UNIX-style date, but interpreted as local time rather
142than UTC.
143.It Dv lestring16
144A two-byte unicode (UCS16) string in little-endian byte order.
145.It Dv melong
146A four-byte value in middle-endian (PDP-11) byte order.
147.It Dv medate
148A four-byte value in middle-endian (PDP-11) byte order,
149interpreted as a UNIX date.
150.It Dv meldate
151A four-byte value in middle-endian (PDP-11) byte order,
152interpreted as a UNIX-style date, but interpreted as local time rather
153than UTC.
154.It Dv indirect
155Starting at the given offset, consult the magic database again.
156.It Dv regex
157A regular expression match in extended POSIX regular expression syntax
158(like egrep). Regular expressions can take exponential time to
159process, and their performance is hard to predict, so their use is
160discouraged. When used in production environments, their performance
161should be carefully checked. The type specification can be optionally
162followed by
163.Dv /[c][s] .
164The
165.Dq c
166flag makes the match case insensitive, while the
167.Dq s
168flag update the offset to the start offset of the match, rather than the end.
169The regular expression is tested against line
170.Dv N + 1
171onwards, where
172.Dv N
173is the given offset.
174Line endings are assumed to be in the machine's native format.
175.Dv ^
176and
177.Dv $
178match the beginning and end of individual lines, respectively,
179not beginning and end of file.
180.It Dv search
181A literal string search starting at the given offset. The same
182modifier flags can be used as for string patterns. The modifier flags
183(if any) must be followed by
184.Dv /number
185the range, that is, the number of positions at which the match will be
186attempted, starting from the start offset. This is suitable for
187searching larger binary expressions with variable offsets, using
188.Dv \e
189escapes for special characters. The offset works as for regex.
190.It Dv default
191This is intended to be used with the test
192.Em x
193(which is always true) and a message that is to be used if there are
194no other matches.
195.El
196.Pp
197Each top-level magic pattern (see below for an explanation of levels)
198is classified as text or binary according to the types used. Types
199.Dq regex
200and
201.Dq search
202are classified as text tests, unless non-printable characters are used
203in the pattern. All other tests are classified as binary. A top-level
204pattern is considered to be a test text when all its patterns are text
205patterns; otherwise, it is considered to be a binary pattern. When
206matching a file, binary patterns are tried first; if no match is
207found, and the file looks like text, then its encoding is determined
208and the text patterns are tried.
209.Pp
210The numeric types may optionally be followed by
211.Dv \*[Am]
212and a numeric value,
213to specify that the value is to be AND'ed with the
214numeric value before any comparisons are done.
215Prepending a
216.Dv u
217to the type indicates that ordered comparisons should be unsigned.
218.It Dv test
219The value to be compared with the value from the file.
220If the type is
221numeric, this value
222is specified in C form; if it is a string, it is specified as a C string
223with the usual escapes permitted (e.g. \en for new-line).
224.Pp
225Numeric values
226may be preceded by a character indicating the operation to be performed.
227It may be
228.Dv = ,
229to specify that the value from the file must equal the specified value,
230.Dv \*[Lt] ,
231to specify that the value from the file must be less than the specified
232value,
233.Dv \*[Gt] ,
234to specify that the value from the file must be greater than the specified
235value,
236.Dv \*[Am] ,
237to specify that the value from the file must have set all of the bits
238that are set in the specified value,
239.Dv ^ ,
240to specify that the value from the file must have clear any of the bits
241that are set in the specified value, or
242.Dv ~ ,
243the value specified after is negated before tested.
244.Dv x ,
245to specify that any value will match.
246If the character is omitted, it is assumed to be
247.Dv = .
248Operators
249.Dv \*[Am] ,
250.Dv ^ ,
251and
252.Dv ~
253don't work with floats and doubles.
254The operator
255.Dv !\&
256specifies that the line matches if the test does
257.Em not
258succeed.
259.Pp
260Numeric values are specified in C form; e.g.
261.Dv 13
262is decimal,
263.Dv 013
264is octal, and
265.Dv 0x13
266is hexadecimal.
267.Pp
268For string values, the string from the
269file must match the specified string.
270The operators
271.Dv = ,
272.Dv \*[Lt]
273and
274.Dv \*[Gt]
275(but not
276.Dv \*[Am] )
277can be applied to strings.
278The length used for matching is that of the string argument
279in the magic file.
280This means that a line can match any non-empty string (usually used to
281then print the string), with
282.Em \*[Gt]\e0
283(because all non-empty strings are greater than the empty string).
284.Pp
285The special test
286.Em x
287always evaluates to true.
288.Dv message
289The message to be printed if the comparison succeeds.
290If the string contains a
291.Xr printf 3
292format specification, the value from the file (with any specified masking
293performed) is printed using the message as the format string.
294If the string begins with
295.Dq \eb ,
296the message printed is the remainder of the string with no whitespace
297added before it: multiple matches are normally separated by a single
298space.
299.El
300.Pp
301An APPLE 4+4 character APPLE creator and type can be specified as:
302.Bd -literal -offset indent
303!:apple	CREATYPE
304.Ed
305.Pp
306A MIME type is given on a separate line, which must be the next
307non-blank or comment line after the magic line that identifies the
308file type, and has the following format:
309.Bd -literal -offset indent
310!:mime	MIMETYPE
311.Ed
312.Pp
313i.e. the literal string
314.Dq !:mime
315followed by the MIME type.
316.Pp
317An optional strength can be supplied on a separate line which refers to
318the current magic description using the following format:
319.Bd -literal -offset indent
320!:strength OP VALUE
321.Ed
322.Pp
323The operand
324.Dv OP
325can be:
326.Dv + ,
327.Dv - ,
328.Dv * ,
329or
330.Dv /
331and
332.Dv VALUE
333is a constant between 0 and 255.
334This constant is applied using the specified operand
335to the currently computed default magic strength.
336.Pp
337Some file formats contain additional information which is to be printed
338along with the file type or need additional tests to determine the true
339file type.
340These additional tests are introduced by one or more
341.Em \*[Gt]
342characters preceding the offset.
343The number of
344.Em \*[Gt]
345on the line indicates the level of the test; a line with no
346.Em \*[Gt]
347at the beginning is considered to be at level 0.
348Tests are arranged in a tree-like hierarchy:
349If a the test on a line at level
350.Em n
351succeeds, all following tests at level
352.Em n+1
353are performed, and the messages printed if the tests succeed, untile a line
354with level
355.Em n
356(or less) appears.
357For more complex files, one can use empty messages to get just the
358"if/then" effect, in the following way:
359.Bd -literal -offset indent
3600      string   MZ
361\*[Gt]0x18  leshort  \*[Lt]0x40   MS-DOS executable
362\*[Gt]0x18  leshort  \*[Gt]0x3f   extended PC executable (e.g., MS Windows)
363.Ed
364.Pp
365Offsets do not need to be constant, but can also be read from the file
366being examined.
367If the first character following the last
368.Em \*[Gt]
369is a
370.Em (
371then the string after the parenthesis is interpreted as an indirect offset.
372That means that the number after the parenthesis is used as an offset in
373the file.
374The value at that offset is read, and is used again as an offset
375in the file.
376Indirect offsets are of the form:
377.Em (( x [.[bislBISL]][+\-][ y ]) .
378The value of
379.Em x
380is used as an offset in the file.
381A byte, id3 length, short or long is read at that offset depending on the
382.Em [bislBISLm]
383type specifier.
384The capitalized types interpret the number as a big endian
385value, whereas the small letter versions interpret the number as a little
386endian value;
387the
388.Em m
389type interprets the number as a middle endian (PDP-11) value.
390To that number the value of
391.Em y
392is added and the result is used as an offset in the file.
393The default type if one is not specified is long.
394.Pp
395That way variable length structures can be examined:
396.Bd -literal -offset indent
397# MS Windows executables are also valid MS-DOS executables
3980           string  MZ
399\*[Gt]0x18       leshort \*[Lt]0x40   MZ executable (MS-DOS)
400# skip the whole block below if it is not an extended executable
401\*[Gt]0x18       leshort \*[Gt]0x3f
402\*[Gt]\*[Gt](0x3c.l)  string  PE\e0\e0  PE executable (MS-Windows)
403\*[Gt]\*[Gt](0x3c.l)  string  LX\e0\e0  LX executable (OS/2)
404.Ed
405.Pp
406This strategy of examining has a drawback: You must make sure that
407you eventually print something, or users may get empty output (like, when
408there is neither PE\e0\e0 nor LE\e0\e0 in the above example)
409.Pp
410If this indirect offset cannot be used directly, simple calculations are
411possible: appending
412.Em [+-*/%\*[Am]|^]number
413inside parentheses allows one to modify
414the value read from the file before it is used as an offset:
415.Bd -literal -offset indent
416# MS Windows executables are also valid MS-DOS executables
4170           string  MZ
418# sometimes, the value at 0x18 is less that 0x40 but there's still an
419# extended executable, simply appended to the file
420\*[Gt]0x18       leshort \*[Lt]0x40
421\*[Gt]\*[Gt](4.s*512) leshort 0x014c  COFF executable (MS-DOS, DJGPP)
422\*[Gt]\*[Gt](4.s*512) leshort !0x014c MZ executable (MS-DOS)
423.Ed
424.Pp
425Sometimes you do not know the exact offset as this depends on the length or
426position (when indirection was used before) of preceding fields.
427You can specify an offset relative to the end of the last up-level
428field using
429.Sq \*[Am]
430as a prefix to the offset:
431.Bd -literal -offset indent
4320           string  MZ
433\*[Gt]0x18       leshort \*[Gt]0x3f
434\*[Gt]\*[Gt](0x3c.l)  string  PE\e0\e0    PE executable (MS-Windows)
435# immediately following the PE signature is the CPU type
436\*[Gt]\*[Gt]\*[Gt]\*[Am]0       leshort 0x14c     for Intel 80386
437\*[Gt]\*[Gt]\*[Gt]\*[Am]0       leshort 0x184     for DEC Alpha
438.Ed
439.Pp
440Indirect and relative offsets can be combined:
441.Bd -literal -offset indent
4420             string  MZ
443\*[Gt]0x18         leshort \*[Lt]0x40
444\*[Gt]\*[Gt](4.s*512)   leshort !0x014c MZ executable (MS-DOS)
445# if it's not COFF, go back 512 bytes and add the offset taken
446# from byte 2/3, which is yet another way of finding the start
447# of the extended executable
448\*[Gt]\*[Gt]\*[Gt]\*[Am](2.s-514) string  LE      LE executable (MS Windows VxD driver)
449.Ed
450.Pp
451Or the other way around:
452.Bd -literal -offset indent
4530                 string  MZ
454\*[Gt]0x18             leshort \*[Gt]0x3f
455\*[Gt]\*[Gt](0x3c.l)        string  LE\e0\e0  LE executable (MS-Windows)
456# at offset 0x80 (-4, since relative offsets start at the end
457# of the up-level match) inside the LE header, we find the absolute
458# offset to the code area, where we look for a specific signature
459\*[Gt]\*[Gt]\*[Gt](\*[Am]0x7c.l+0x26) string  UPX     \eb, UPX compressed
460.Ed
461.Pp
462Or even both!
463.Bd -literal -offset indent
4640                string  MZ
465\*[Gt]0x18            leshort \*[Gt]0x3f
466\*[Gt]\*[Gt](0x3c.l)       string  LE\e0\e0 LE executable (MS-Windows)
467# at offset 0x58 inside the LE header, we find the relative offset
468# to a data area where we look for a specific signature
469\*[Gt]\*[Gt]\*[Gt]\*[Am](\*[Am]0x54.l-3)  string  UNACE  \eb, ACE self-extracting archive
470.Ed
471.Pp
472Finally, if you have to deal with offset/length pairs in your file, even the
473second value in a parenthesized expression can be taken from the file itself,
474using another set of parentheses.
475Note that this additional indirect offset is always relative to the
476start of the main indirect offset.
477.Bd -literal -offset indent
4780                 string       MZ
479\*[Gt]0x18             leshort      \*[Gt]0x3f
480\*[Gt]\*[Gt](0x3c.l)        string       PE\e0\e0 PE executable (MS-Windows)
481# search for the PE section called ".idata"...
482\*[Gt]\*[Gt]\*[Gt]\*[Am]0xf4          search/0x140 .idata
483# ...and go to the end of it, calculated from start+length;
484# these are located 14 and 10 bytes after the section name
485\*[Gt]\*[Gt]\*[Gt]\*[Gt](\*[Am]0xe.l+(-4)) string       PK\e3\e4 \eb, ZIP self-extracting archive
486.Ed
487.Sh SEE ALSO
488.Xr file 1
489\- the command that reads this file.
490.Sh BUGS
491The formats
492.Dv long ,
493.Dv belong ,
494.Dv lelong ,
495.Dv melong ,
496.Dv short ,
497.Dv beshort ,
498.Dv leshort ,
499.Dv date ,
500.Dv bedate ,
501.Dv medate ,
502.Dv ledate ,
503.Dv beldate ,
504.Dv leldate ,
505and
506.Dv meldate
507are system-dependent; perhaps they should be specified as a number
508of bytes (2B, 4B, etc),
509since the files being recognized typically come from
510a system on which the lengths are invariant.
511.\"
512.\" From: guy@sun.uucp (Guy Harris)
513.\" Newsgroups: net.bugs.usg
514.\" Subject: /etc/magic's format isn't well documented
515.\" Message-ID: <2752@sun.uucp>
516.\" Date: 3 Sep 85 08:19:07 GMT
517.\" Organization: Sun Microsystems, Inc.
518.\" Lines: 136
519.\"
520.\" Here's a manual page for the format accepted by the "file" made by adding
521.\" the changes I posted to the S5R2 version.
522.\"
523.\" Modified for Ian Darwin's version of the file command.
524