Since the data collection for my project on strategic communication of politicians on Facebook has been completed, it´s about time I write an extensive lab report on how it went. I have experimented with web scraping in R and Python for a while now, but this was by far the most extensive data collection I have ever conducted. In total, I collected the Facebook posts of 1,398 political candidates during the last 4 years, covering the whole 18th election period of the German Bundestag. The total sample resulted in about 710,000 posts. Of those, ca. 390,000 were classified as photo posts and for the purpose of considering also the visual aspects of strategic framing I also collected those.
The starting point of my data collection was a list of the 2653 politicians who were running in the German federal election in 2017 (19th election period of the German parliament) as well as those parliamentarians who were members of the preceding election period but did not compete in the current election. This list was compiled using several online sources like the website of the German Bundestag, the website wenwaehlen.de (an initiative comparable to Wahl-O-Mat, where voters can check on single candidates opinions not only on parties), as well as Wikipedia. These sources supply the social media links for some of the politicians, but it turned out that many Facebook profiles were missing in automated data collection and some Facebook links seemed to be outdated. Thus, I corrected and supplemented the lists with the politicians Facebook identifiers in a manual search (*phew!*). The search resulted in 2066 Facebook profiles of election candidates. In reverse, this means that for 587 politicians no Facebook was available (22.1 percent of all candidates). Although one could assume that a Facebook is a standard instrument in modern campaigning, even some very prominent politicians do not maintain a profile. – For instance, Federal Minister of the Interior Thomas de Maizière, CDU, dropped out of the sample for this reason.
My list of election candidates contains other social media links, but I did not systematically check those, because they are not relevant for the specific research purpose of the current project. Anyway, for those who are interested in conducting a similar analysis of German parliamentarians in social media, the list can be downloaded here and it will also be available on git.
Next, I had to choose an appropriate storage for the data. Since all individual records are relatively homogenous in their attributes and the proposed data records have a relational relationship to each other (politicians Facebook profiles, their posts as well as visuals contained in theses posts) a relational SQL database was the natural choice. Moreover SQL databases can inherit not only textual but also visual data in blob objects, which was a further advantage. Thus, I installed MariaDB as well as phpMyAdmin on my server and was ready for data collection.
I chose to conduct the scraping as well as the analysis in R and not in Python. Besides a general preference for the R language which has a low threshold and is very flexible, my main reason to take hold on R is because it is by far more compatible to my colleagues in communication then python is. Moreover, this project should serve as a proof of what the R language is capable of in regard to openness: I wanted my own showcase how R can be used for the entire research process from data collection, to analysis as well as publishing. And last but not least there is already a package to access the Facebook API via R: Rfacebook. Although it doesn´t solve every problem (more about this later), the package considerably facilitated my data collection (for this project I used the latest stable version 0.6.15).
The first step in the data collection was to store the .csv file of politicians into the database. Doesn’t sound too difficult, does it? Well, it nearly freaked me out! The challenge was to get the encoding right. I don’t know if this is a problem that only Windows users will encounter. I finally I found a workaround which I will document here to remember it on future occasions:
- Save the .csv file from Excel “separated by separators” (“mit Trennzeichen getrennt”).
- Open the .csv in the simple editor provided by Windows and save it again with encoding = “UTF 8”.
- When importing this .csv file in R, I set attribute encoding to „UTF-8_bin“ in the read.csv2(). Weirdly, when I check the dataframe with the view() function in R after this procedure it seems to be all messed up. But what is more important, the import to the SQL database works correctly.
- Put dataframe into the database using the RMySQL::dbWriteTable function.
The next step and the beginning of the actual data collection was to check if the politicians Facebook profiles I collected manually were a) publicly available via the API and B) if they were conceived as “user” or as “page”. Although these infos are already listed in the table of politicians provided above, they might change over time so it´s worth considering to redo this check if the list of politicians is used in another context.
- Regarding the publicity of Facebook profiles, site admins may set the visibility of their profile to public or private. Of course, only those profiles whose owners have chosen to make their content available can be accessed via the Facebook API. Nonetheless some profiles are still accessible via manual search on the platform. Self-evidently, I respect user privacy. Nonetheless, I quarrel with this situation since some of these profiles are obviously not personal or private by content but clearly aim at a broader public. Hence, I guess that some of the politicians and/or their social media staff members are not aware of the fact that their Facebook profile is not completely public (and thus cannot be found via search engines etc.) or they do not care. In total, 671 profiles are configured as private and thus dropped out of data collection.
- Most politicians (n = 1315; 94.3 %) in the remaining sample created their personal Facebook representation as Facebook “page”. This makes sense since Facebook “pages” distinguish professional or business accounts from ordinary “user” profiles. Nonetheless the sample still contains 80 non-private “user” profiles (5.7 %). This has consequences for the profiles attributes, but not for the posts on theses profiles, so it is only marginally relevant for the project: A user profile does not contain or reveal as many attributes in data collection via API: E. g. information on affiliation, birthday, biography or category of the profile cannot be downloaded from user profiles. Since the collection of posts is not affected by this differentiation, it does not really matter but it needs to be taken into account when politicians profile information should be downloaded (which I did, seen next step).
The third step of the data collection was to access the politician´s profiles. I wanted to collect them to gather some background information on the sample as well as to crosscheck whether I got the “right” profiles. Some politicians have names which are very common and there are even duplicated names within the sample (like two “Michael Meisters” one CDU, one AfD). I plan another report on the crosschecks of the data that I did. But for now let´s get back to data collection. The accessing of the profiles was the first challenge for the Rfacebook package. Actually I didn´t find a function which exactly extracted all the info I wanted. Hence, I wrote a simple GET request which returned the specific fields I was interested in. Next challenge was again to store the newly encountered data in the database and keep the right encoding. This was ensured in allocating the Encoding to „UTF-8_bin“ for every non-English text variable. In total, I collected the profiles of 1395 campaigning politicians.
Until now, the data is neither very big nor does the collection take very long. This changed in the next steps, the collection of the posts and the collection of visuals, because my aim was to download all posts from all campaigning politicians with Facebook profiles during the hole 18th election period of the German Bundestag (4 years). I decided to separate these steps from each other and to use two tables in the database to collect the posts and the visuals. The script to collect posts on Facebook is not very notable; again I had to do several checks on the encoding before everything worked fine. Moreover I decided to collect the download errors in a separate database table to gain control over them. The script was running for several days (or weeks?), it was a bit annoying though, since I had to restart the script every two hours because the token to access the API was only valid for so long. Also I encountered that some politicians had changed or deleted their Facebook profiles in the meantime, which forced me to update the sample all along. To be able to trace when I have saved a certain post, I wrote the download time into the database.
Deletion of profiles or single posts was also a problem in the final step, the collection of the visuals that could not be resolved. For data safety and practical reasons, I decided to save the visuals in two ways: First as a blob object in the database and second as a .jpg on my local hard drive. I also decided to collect only visuals which were posted in “photo posts”, video material and visuals were left out due to practical as well as conceptual reasons. In total, 389,741 pictures were downloaded, which take up nearly 30 GB of data. Given this amount of data, I will probably have to rethink the scope of this project and reduce the sample to maybe only one year of posts. I know this project cannot be considered really big data, but for me this is quite an impressive number!
All in all I´m pretty pleased with how the data collection went. I learned a lot on R, the Facebook API, as well as SQL databases. The next task will be to describe and visualize characteristic features of the sample. Of course, I will proudly present some of the insights here soon. Before I close this post which has become incredibly long, I would like to mention and remember the five most annoying things I encountered during data collection. – Since it is good practice to document not only the triumphs but also the failures. So here is the top five of what annoyed me out during data collection:
- Bad Encoding. It took me a while, but now I found some working solutions, although they feel kind of wonky.
- Politicians changing their Facebook profiles or deleting their profiles, posts and/or visuals.
- Caching of the phpMyAdmin interface (due to the caching issues I was not able to log into my account for nearly a day – Of course I didn´t know it was a caching issue then…)
- Renewing the Facebook token over and over again… and again…
- Excels nasty habit to display and save large integers in scientific format. Of course, the Facebook identifier can be seen as a large integer (it has 15 digits or so). Well, but feeding the Facebook API with 1,2345E+14 and similar does not really work…