xref: /dflybsd-src/contrib/file/magic/Magdir/statistics (revision 739f0ef867128a933e021db3d831e906fcafd825)
1970935fdSSascha Wildner
2970935fdSSascha Wildner#------------------------------------------------------------------------------
3*3b9cdfa3SAntonio Huete Jimenez# $File: statistics,v 1.3 2022/03/24 15:48:58 christos Exp $
4970935fdSSascha Wildner# statistics:  file(1) magic for statistics related software
5970935fdSSascha Wildner#
6970935fdSSascha Wildner
7970935fdSSascha Wildner# From Remy Rampin
8970935fdSSascha Wildner
9970935fdSSascha Wildner# Stata is a statistical software tool that was created in 1985. While I
10970935fdSSascha Wildner# don't personally use it, data files in its native (proprietary) format
11970935fdSSascha Wildner# are common (.dta files).
12970935fdSSascha Wildner#
13970935fdSSascha Wildner# Because they are so common, especially in statistical and social
14970935fdSSascha Wildner# sciences, Stata files and SPSS files can be opened by a lot of modern
15970935fdSSascha Wildner# software, for example Python's pandas package provides built-in
16970935fdSSascha Wildner# support for them (read_stata() and read_spss()).
17970935fdSSascha Wildner#
18970935fdSSascha Wildner# I noticed that the magic database includes an entry for SPSS files but
19970935fdSSascha Wildner# not Stata files. Stata files for Stata 13 and newer (formats 117, 118,
20970935fdSSascha Wildner# and 119) always begin with the string "<stata_dta><header>" as per
21970935fdSSascha Wildner# https://www.stata.com/help.cgi?dta#definition
22970935fdSSascha Wildner#
23970935fdSSascha Wildner# The format version number always follows, for example:
24970935fdSSascha Wildner#    <stata_dta><header><release>117</release>
25970935fdSSascha Wildner#    <stata_dta><header><release>118</release>
26970935fdSSascha Wildner#
27970935fdSSascha Wildner# Therefore the following line would do the trick:
28970935fdSSascha Wildner#    0       string  <stata_dta><header>     Stata Data File
29970935fdSSascha Wildner#
30970935fdSSascha Wildner# (I'm sure the version number could be captured as well but I did not
31970935fdSSascha Wildner# manage this without a regex)
32970935fdSSascha Wildner#
33970935fdSSascha Wildner# Unfortunately the previous formats (created by Stata before 13, which
34970935fdSSascha Wildner# was released 2013) are harder to recognize. Format 115 starts with the
35970935fdSSascha Wildner# four bytes 0x73010100 or 0x73020100, format 114 with 0x72010100 or
36970935fdSSascha Wildner# 0x72020100, format 113 with 0x71010101 or 0x71020101.
37970935fdSSascha Wildner#
38970935fdSSascha Wildner# For additional reference, the Library of Congress website has an entry
39970935fdSSascha Wildner# for the Stata Data File Format 118:
40970935fdSSascha Wildner# https://www.loc.gov/preservation/digital/formats/fdd/fdd000471.shtml
41970935fdSSascha Wildner#
42970935fdSSascha Wildner# Example of those files can be found on Zenodo:
43970935fdSSascha Wildner# https://zenodo.org/search?page=1&size=20&q=&file_type=dta
44970935fdSSascha Wildner0	string	\<stata_dta\>\<header\>\<release\>	Stata Data File
45*3b9cdfa3SAntonio Huete Jimenez>&0	regex	[0-9]+					(Release %s)
46