xref: /netbsd-src/external/bsd/file/dist/magic/magdir/statistics (revision a77ebd868432a4d7e595fb7709cfc1b8f144789b)
125f16eeaSchristos
225f16eeaSchristos#------------------------------------------------------------------------------
3*a77ebd86Schristos# $File: statistics,v 1.3 2022/03/24 15:48:58 christos Exp $
425f16eeaSchristos# statistics:  file(1) magic for statistics related software
525f16eeaSchristos#
625f16eeaSchristos
725f16eeaSchristos# From Remy Rampin
825f16eeaSchristos
925f16eeaSchristos# Stata is a statistical software tool that was created in 1985. While I
1025f16eeaSchristos# don't personally use it, data files in its native (proprietary) format
1125f16eeaSchristos# are common (.dta files).
1225f16eeaSchristos#
1325f16eeaSchristos# Because they are so common, especially in statistical and social
1425f16eeaSchristos# sciences, Stata files and SPSS files can be opened by a lot of modern
1525f16eeaSchristos# software, for example Python's pandas package provides built-in
1625f16eeaSchristos# support for them (read_stata() and read_spss()).
1725f16eeaSchristos#
1825f16eeaSchristos# I noticed that the magic database includes an entry for SPSS files but
1925f16eeaSchristos# not Stata files. Stata files for Stata 13 and newer (formats 117, 118,
2025f16eeaSchristos# and 119) always begin with the string "<stata_dta><header>" as per
2125f16eeaSchristos# https://www.stata.com/help.cgi?dta#definition
2225f16eeaSchristos#
2325f16eeaSchristos# The format version number always follows, for example:
2425f16eeaSchristos#    <stata_dta><header><release>117</release>
2525f16eeaSchristos#    <stata_dta><header><release>118</release>
2625f16eeaSchristos#
2725f16eeaSchristos# Therefore the following line would do the trick:
2825f16eeaSchristos#    0       string  <stata_dta><header>     Stata Data File
2925f16eeaSchristos#
3025f16eeaSchristos# (I'm sure the version number could be captured as well but I did not
3125f16eeaSchristos# manage this without a regex)
3225f16eeaSchristos#
3325f16eeaSchristos# Unfortunately the previous formats (created by Stata before 13, which
3425f16eeaSchristos# was released 2013) are harder to recognize. Format 115 starts with the
3525f16eeaSchristos# four bytes 0x73010100 or 0x73020100, format 114 with 0x72010100 or
3625f16eeaSchristos# 0x72020100, format 113 with 0x71010101 or 0x71020101.
3725f16eeaSchristos#
3825f16eeaSchristos# For additional reference, the Library of Congress website has an entry
3925f16eeaSchristos# for the Stata Data File Format 118:
4025f16eeaSchristos# https://www.loc.gov/preservation/digital/formats/fdd/fdd000471.shtml
4125f16eeaSchristos#
4225f16eeaSchristos# Example of those files can be found on Zenodo:
4325f16eeaSchristos# https://zenodo.org/search?page=1&size=20&q=&file_type=dta
4425f16eeaSchristos0	string	\<stata_dta\>\<header\>\<release\>	Stata Data File
45*a77ebd86Schristos>&0	regex	[0-9]+					(Release %s)
46