125f16eeaSchristos 225f16eeaSchristos#------------------------------------------------------------------------------ 3*a77ebd86Schristos# $File: statistics,v 1.3 2022/03/24 15:48:58 christos Exp $ 425f16eeaSchristos# statistics: file(1) magic for statistics related software 525f16eeaSchristos# 625f16eeaSchristos 725f16eeaSchristos# From Remy Rampin 825f16eeaSchristos 925f16eeaSchristos# Stata is a statistical software tool that was created in 1985. While I 1025f16eeaSchristos# don't personally use it, data files in its native (proprietary) format 1125f16eeaSchristos# are common (.dta files). 1225f16eeaSchristos# 1325f16eeaSchristos# Because they are so common, especially in statistical and social 1425f16eeaSchristos# sciences, Stata files and SPSS files can be opened by a lot of modern 1525f16eeaSchristos# software, for example Python's pandas package provides built-in 1625f16eeaSchristos# support for them (read_stata() and read_spss()). 1725f16eeaSchristos# 1825f16eeaSchristos# I noticed that the magic database includes an entry for SPSS files but 1925f16eeaSchristos# not Stata files. Stata files for Stata 13 and newer (formats 117, 118, 2025f16eeaSchristos# and 119) always begin with the string "<stata_dta><header>" as per 2125f16eeaSchristos# https://www.stata.com/help.cgi?dta#definition 2225f16eeaSchristos# 2325f16eeaSchristos# The format version number always follows, for example: 2425f16eeaSchristos# <stata_dta><header><release>117</release> 2525f16eeaSchristos# <stata_dta><header><release>118</release> 2625f16eeaSchristos# 2725f16eeaSchristos# Therefore the following line would do the trick: 2825f16eeaSchristos# 0 string <stata_dta><header> Stata Data File 2925f16eeaSchristos# 3025f16eeaSchristos# (I'm sure the version number could be captured as well but I did not 3125f16eeaSchristos# manage this without a regex) 3225f16eeaSchristos# 3325f16eeaSchristos# Unfortunately the previous formats (created by Stata before 13, which 3425f16eeaSchristos# was released 2013) are harder to recognize. Format 115 starts with the 3525f16eeaSchristos# four bytes 0x73010100 or 0x73020100, format 114 with 0x72010100 or 3625f16eeaSchristos# 0x72020100, format 113 with 0x71010101 or 0x71020101. 3725f16eeaSchristos# 3825f16eeaSchristos# For additional reference, the Library of Congress website has an entry 3925f16eeaSchristos# for the Stata Data File Format 118: 4025f16eeaSchristos# https://www.loc.gov/preservation/digital/formats/fdd/fdd000471.shtml 4125f16eeaSchristos# 4225f16eeaSchristos# Example of those files can be found on Zenodo: 4325f16eeaSchristos# https://zenodo.org/search?page=1&size=20&q=&file_type=dta 4425f16eeaSchristos0 string \<stata_dta\>\<header\>\<release\> Stata Data File 45*a77ebd86Schristos>&0 regex [0-9]+ (Release %s) 46