1970935fdSSascha Wildner 2970935fdSSascha Wildner#------------------------------------------------------------------------------ 3*3b9cdfa3SAntonio Huete Jimenez# $File: statistics,v 1.3 2022/03/24 15:48:58 christos Exp $ 4970935fdSSascha Wildner# statistics: file(1) magic for statistics related software 5970935fdSSascha Wildner# 6970935fdSSascha Wildner 7970935fdSSascha Wildner# From Remy Rampin 8970935fdSSascha Wildner 9970935fdSSascha Wildner# Stata is a statistical software tool that was created in 1985. While I 10970935fdSSascha Wildner# don't personally use it, data files in its native (proprietary) format 11970935fdSSascha Wildner# are common (.dta files). 12970935fdSSascha Wildner# 13970935fdSSascha Wildner# Because they are so common, especially in statistical and social 14970935fdSSascha Wildner# sciences, Stata files and SPSS files can be opened by a lot of modern 15970935fdSSascha Wildner# software, for example Python's pandas package provides built-in 16970935fdSSascha Wildner# support for them (read_stata() and read_spss()). 17970935fdSSascha Wildner# 18970935fdSSascha Wildner# I noticed that the magic database includes an entry for SPSS files but 19970935fdSSascha Wildner# not Stata files. Stata files for Stata 13 and newer (formats 117, 118, 20970935fdSSascha Wildner# and 119) always begin with the string "<stata_dta><header>" as per 21970935fdSSascha Wildner# https://www.stata.com/help.cgi?dta#definition 22970935fdSSascha Wildner# 23970935fdSSascha Wildner# The format version number always follows, for example: 24970935fdSSascha Wildner# <stata_dta><header><release>117</release> 25970935fdSSascha Wildner# <stata_dta><header><release>118</release> 26970935fdSSascha Wildner# 27970935fdSSascha Wildner# Therefore the following line would do the trick: 28970935fdSSascha Wildner# 0 string <stata_dta><header> Stata Data File 29970935fdSSascha Wildner# 30970935fdSSascha Wildner# (I'm sure the version number could be captured as well but I did not 31970935fdSSascha Wildner# manage this without a regex) 32970935fdSSascha Wildner# 33970935fdSSascha Wildner# Unfortunately the previous formats (created by Stata before 13, which 34970935fdSSascha Wildner# was released 2013) are harder to recognize. Format 115 starts with the 35970935fdSSascha Wildner# four bytes 0x73010100 or 0x73020100, format 114 with 0x72010100 or 36970935fdSSascha Wildner# 0x72020100, format 113 with 0x71010101 or 0x71020101. 37970935fdSSascha Wildner# 38970935fdSSascha Wildner# For additional reference, the Library of Congress website has an entry 39970935fdSSascha Wildner# for the Stata Data File Format 118: 40970935fdSSascha Wildner# https://www.loc.gov/preservation/digital/formats/fdd/fdd000471.shtml 41970935fdSSascha Wildner# 42970935fdSSascha Wildner# Example of those files can be found on Zenodo: 43970935fdSSascha Wildner# https://zenodo.org/search?page=1&size=20&q=&file_type=dta 44970935fdSSascha Wildner0 string \<stata_dta\>\<header\>\<release\> Stata Data File 45*3b9cdfa3SAntonio Huete Jimenez>&0 regex [0-9]+ (Release %s) 46