Dealing with Facebook data is a lot of pain. Not only due to its size, the form of the data is also not what I am familiar with before working on this project. This probably has to do with my background. Econ majors are typically trained to analyze data in the forms of cross section, panel data, or time series. Softwares like Stata are also optimized to deal with these kinds of structure.
However, there are multiple levels embeded inside the structure of Facebook. Some common levels are pages, posts, users, reactions, and comments. These levels are related to each other (for example, you need user id and post id to uniquely identify an reaction), and often you will need multiple levels while you are processing them.
You can still store them into your already-familiar flat format, but you may end up with storing an extremely large amount of redundant information.1
Hence, there are already many mature techniques, such as SQL and its relatives, to help dealing with storage and retrieval of this kind of relational data.
SQL and Cloud
On applied micro class, I shared some thoughts and experience on the process of dealing with Facebook data. Here are the materials, which includes:
- Some really basic SQL
- Use Google BigQuery (an extremely fast SQL service) and Cloud Storage
- Talk to them through R
- Run on a local MySQL server
Ideally, all data can be stored on the cloud such as Google BigQuery. You retrieve the data and transform it into the format you want to analyze, and pays for the resources used during this process. This is far more scalable than running on a local machine.
However, if your budget is limited or you need to repeatedly process the same groups of data (such as running loops), you may still consider working them on a local computer (probably with a database such as MySQL inside).
In this case, you can take a look at this document
for the file directories
and processing (splitting, combining, inserting, etc to csv) using command line
sed for row wise and
awk for column wise transformations).
JSON format can deal with this problem, but it is still quite painful to read data directly from JSON, especially if it is not stored properly.^