Tips for managing large-scale datasets efficiently in Stata

Occasionally we manage enormous datasets with a lot of variables and observations. Depending on the power of our machine, computations in Stata might take more or less time. However, these computations usually take more time than we would like, which is frustrating and inefficient. For this reason, I would like to share some tips that I have learned over many hours in front of Stata.

  • compress command. I would say that it is one of the most valuable commands in Stata. It reduces the size of your dataset by converting the storage type of your variables into the most efficient typology. For instance, we have a string variable str200. This means the variable might take 200 characters as maximum length. However, the longest value in that variable takes 50 characters. Then, the compress command converts the variable’s storage type from str200 to str50. Similarly, a numerical variable can be in int format. This means it can take values comprised between -32.767 and 32.740. But imagine that variable’s values range from 0 to 99. Then, converting this int variable to byte variable (values between -127 and 100) reduces the memory usage. This command not only reduces the amount of memory used by the dataset, it also saves a lot of space (in terms of MBs and even GBs) of the DTA file stored in the hard disk. Consequently, the use operation will be faster next time we load this dataset in our Stata workspace. For more information about this command, I recommend to check the help compress option in Stata or check this video from StataCorp.
  • Keep only the necessary variables. The speed of commands such as collapse or reshape, to name a few, depends on the number of variables in the dataset (among other factors). Then, as fewer variables are loaded, the faster will be some specific functions. Moreover, some procedures create temporary datasets stored in your memory. Thus, as more free space you have there, the less probability of running out of memory.
  • Keep only the necessary observations. Sometimes we don’t need a specific subset of observations in our analysis, but we keep them with a dummy or categorical variable to identify them.. In this case, when we manage large-scale datasets, I would suggest dropping them to speed up our work with the dataset. However, it might be helpful to store them in a smaller dataset just in case you need them for something. Moreover, this point and the previous one might help you to save smaller .dta on your hard disk.
  • Load only the necessary variables and observations. To load a dataset, the command use might be combined with options to load only the desired variables and observations. This is very useful in order not to start with a massive dataset with unnecessary variables and observations. For that, you only need to load the dataset in the following way: use [varlist] [if] using dataset.dta. For example, use id age sex if sex==1 using dataset.dta.
  • keepusing option. There are at least three commands in which this option can be beneficial: merging two datasets (merge), joining two datasets (joinby) and appending two datasets (append). This function, keepusing(var1 var2 ...) keeps only the vector of variables indicated inside the parenthesis instead of all the variables in the using dataset. Only keeping few variables from the using dataset will speed up merging, appending, or joining operations. Moreover, the final dataset will contain fewer variables, allowing to perform faster operations.
  • ftools and gtools packages. These two packages contain common Stata commands (such as merge, collapse or isid), but they have been programmed more efficiently and perform faster than the original functions. In the creators’ webpage, you might find benchmarks of their commands compared with the original ones and more details about the commands inside these packages. Moreover, by checking these benchmarks, one might find an incredible gain of time using them. For more information: ftools webpage and gtools webpage.
  • Split the dataset. Depending on what you are doing, you might divide your dataset into more than one dataset and save them as .dta files. In that vein, you can split your procedures into different parts depending on the dataset required or apply a loop over the datasets. The combinations are infinite, and the best way to proceed depends on your tasks. Keep in mind that loops might also overload the memory usage, so it could be better not to perform them and run the code in chunks.
  • parallel package. I have to admit that I have never used it since I have the MP Edition, but I think it might help in speeding computations up for IC/BE/SE users. One advantage of Stata MP in front of IC/BE and SE licenses is that they parallel the procedures. In a nutshell (I’m not an expert at all, so I hope I explain adequately), Stata makes the most of your multi-core processor (nowadays, all computers have more than one core), and it parallels the computations across cores (also called “parallel computing” or “parallel processing”). In other words, Stata runs faster because it can run several operations simultaneously. Then, this package, parallel, makes something similar, allowing you to speed your computations up with your IC/BE/SE license. For more information, the package’s webpage and the article at Stata Journal.

Finally, I would like to highlight that these are just few tips that work for me. For sure, there are more, and you can share with the community and me your tricks to perform faster your routines in Stata with large-scale datasets. Specially, I would like to know experiences with the parallel package. Nowadays, it is more common to manage these datasets, and all the tips will be more than welcomed.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.