Can we keep this between us? A simulation of negative network externalities in contact list sharing

Clara Hanson


People often share network information with third parties by granting access to their phone contacts, email address books, and social media friends lists. Yet privacy researchers are concerned about how individuals share network data. It results in the circulation of personally identifying information without an altar’s knowledge or consent, and compiling contact lists can reveal a great deal of social information. Privacy research often assumes individuals have control over their own information, but sharing contact lists reveals how interdependent privacy truly is. This research links micro action to macro outcomes by simulating how much network data can be accurately predicted as individuals share contact lists. This research contributes a sociological perspective to data interdependency research by considering the implications of increasingly accurate methods of prediction for individual agency and collective outcomes when those methods are applied to patterned social information. This research tests how much network information can be recovered from a dataset of contact lists, especially at low levels of sharing. It also tests whether individual intervention at the dyad level is sufficient to preserve an individual’s privacy. I will use two empirically observed email networks – the EU email network (1,005 nodes), and the Enron email network (36,692 nodes). I will also create synthetic network data using descriptive statistics of email graphs to simulate networks with 100,000 and 1,000,000 nodes. I will use HOPE node embedding and the Adamic-Adar index to predict missing nodes for each simulation. I will compare the accuracy of these methods using precision, recall, and harmonic mean. I will test the following hypotheses: H1 Node and edge recovery: I expect the pattern that underlies contact list sharing (ie randomly versus diffusion-like) affect the kind of network data that is most quickly and accurately recovered. When nodes share contact lists randomly, the third party will compile an index of all nodes in the network more quickly. When the probability of a node sharing their contact list is conditioned on whether their neighbors have shared, ties among shared nodes will be predicted more accurately. H2 Effect of opt-out intervention: To test whether individual-level interventions are sufficient, I will simulate network recovery when nodes can delete their own ties to others. I expect this will lead to overall lower rates of edge recovery, but will not always be sufficient to prevent tie rediscovery. H3 Effect of social preference: I will simulate networks where nodes delete ties to others at random, and networks where nodes show a social preference for tie deletion – meaning they are more likely to delete weak ties and share strong ties. I expect social preference, which may appear more privacy-protecting to individuals, will lead to more accurate edge recovery. H4 Effect of network size: I will run these simulations on networks of different size to test the effect of graph size on edge recovery. I expect the effect of network size and rate of sharing will have a multiplicative effect on accuracy of edge estimation. 

← Schedule