Self-organization in a decentralized data-sharing system
Document-ranking metrics based on the topology of social-network users in data-sharing systems reduce the effects of malicious agents and provide a rational incentive for participation.
Typical Web 2.0 data-sharing platforms (such as Wikipedia, Facebook, Reddit and Yelp) are governed by a single authority, i.e., a single service provider who has the ultimate rights regarding what happens to data that is contributed by its users. In these systems, misalignments can develop between the goals of the users and those of the authority, e.g., the freedom of expression of a user vs. legal constraints, or a real or perceived censorship or ideological agenda pushed by the authority., As the authority is usually in charge of ranking documents and recommending them, there is a potential for user mistrust regarding the results that are returned. The knowledge of a centralized reputation system opens the door for a malicious agent to game the rankings. Moreover, if the single authority goes away, all contributions may be lost. Self-organization techniques represent a method by which these issues can be avoided.
We have investigated ways in which useful data-sharing applications can be provided within a decentralized-authority setting. In such a system, users store the documents that they would like to share locally, and can autonomously rank the search results for the document that they are looking for using various metrics that can be derived from the topology of the social network they belong to (more on these metrics and their relevance later). Without centralized ownership of data, censorship and bias are harder to enforce. Additionally, without centralized algorithms for ranking search results, it is harder for attackers to game the system. For this idea to work, however, there must be an incentive for users to participate in the system and to self-regulate. The question is: would such a system also be beneficial to each individual user? The advantages of this system must overcome the loss of conveniences provided by a central authority (e.g., ease of introducing new features, no ‘cold start’ problem and identity management).
Distributed systems deal with the problem of centralization as a single point of failure. Data grids and cloud computing make the notion of data location, storage and distribution transparent to the user and provide redundancies that lead to a more robust system. However, the authority remains centralized in these settings. Distributed hash tables enable the distribution of data among participating peers and provide efficient ways to locate and retrieve it. However, decisions regarding the location of the data lie in the hands of the algorithm, not the user. Indeed, the user may feel uncomfortable hosting a document that is published by an unknown peer and pushed by an algorithm, even if this document is encrypted and unreadable by the host. Peer-to-peer (P2P) file-sharing systems (e.g., Limewire and Tribler) are closer to what we are looking for. In these systems, only the files that a user shares are kept on their machine. The documents are typically ranked based on their popularity (i.e., the number of copies that are disseminated in the system). Although this is an objective metric that can be calculated locally, it is not always the best one. For example, it does not provide any way to make meaningful distinctions based on which specific users or kinds of users provide the documents that could help infer their taste, point of view or malicious intent. These applications are also limited in the functionality that they offer. Notably, there is no way to navigate from one document to the other using hyperlinks (because the location of a document depends on the availability of the machine hosting it, and there is no location-independent addressing scheme in these systems).
We have proposed a solution to these issues. The decentralized social data-sharing system (DS2) combines a social network with a P2P file-sharing system. It is therefore able to take advantage of P2P decentralization in addition to social-network metrics for ranking documents: see Figure 1. In a DS2, users can publish documents locally and are also able to search for documents by asking their peers. Queries are propagated via `gossip’. This network topology is not arbitrary, and each user within it is in charge of adding or removing their immediate neighbours, just as they would be in a traditional social network. This self-organization provides meaningful metrics for users to filter and rank documents (e.g., metrics based on the proximity of the peers who share those documents in the network). In turn, this provides an incentive for users to be careful in their choice of neighbours. Another important aspect of a DS2 is the generation of a unique identifier for each document (based on the hash of its content). Documents can therefore be referenced (named) independently of their location and, consequently, any available copy with the given identifier can be retrieved via gossip search. However, in terms of latency and the number of messages that need to be generated and exchanged, this aspect represents a bottleneck of the system.
To determine the usefulness of DS2 systems, we built a decentralized wiki system called P2Pedia. WWhen a user edits a page in P2Pedia, the system creates a new document with the added content and stores it locally. The new document retains a reference to the parent from which it originated. This non-destructive approach allows multiple versions of a document, and therefore multiple points of view, to co-exist in the system. We evaluated the use of this system in an undergraduate course setting. Users were asked to improve a seed document and were allowed to base their changes on existing documents in the system. Because there were many possible ways in which the document could be improved, students could improve on each other’s versions iteratively while retaining their own independent version. This method is not dissimilar to the organic mechanisms of scientific research or open-source software projects, where ideas can ‘fork’ and grow independently of one another.
Given the proliferation of documents, an important consideration is the way in which a user can decide whether a document is a good fit to use or build upon. Many trust indicators can be derived from the topology of the DS2. First, as in a P2P file-sharing network, the popularity of the document can be calculated by the number of copies that can be retrieved. A peer-popularity metric, based on the number of peers that decide to adopt this peer as a neighbour, can also be calculated. Documents can therefore be ranked based on the popularity of the peers who returned them. These popularity rankings could potentially be `attacked’ (gamed) via Sybil attacks, in which identities are forged to subvert a reputation system. Malicious or over-eager peers could decide to create many clones, have them host a copy of the same document (to boost the document popularity metric), and also adopt each other as neighbours to artificially boost their peer popularity. However, other trust indicators exist based on the similarities of the querying peer and those queried (in terms of the documents they both share, or the peers that they both follow). For an attack to be effective, it would have to operate on all of these metrics at the same time (not knowing which one a user would choose) and would therefore be costly (in terms of time, money, memory or bandwidth usage and possibly other resources).
To make these trust indicators effective, users need to contribute. For document similarity to work, the users need to share/host documents that they like and recommend (in a DS2 recommendation system setting, this is equivalent to users genuinely recommending the documents). For peer similarity to work, users need to choose neighbours that they find useful. Those who do not participate in these meaningful ways (‘free-riders’) will find themselves isolated, and ranking metrics won’t work well for them. In the absence of a full-scale DS2 that we could observe and evaluate empirically, one objective way to validate this incentivized-participation scheme is to frame it as an iterated n-player social-network Prisoner’s Dilemma game. In this setup, defectors (the free-riders) are punished by being dropped from the network neighbourhood. Our simulation of such a setting was able to confirm that active participants (the ‘cooperators’) were rewarded more than defectors, and that the social graph tended to evolve into one containing a large connected component made of the cooperative peers. We also used data from the Yelp application to show that aggregating recommendations from carefully selected peers, or from the connected component that results, outperforms a baseline aggregate ranking. It even improved on Yelp’s proprietary filter, which is controversial and has been accused of bias.
We have proposed a decentralized social data-sharing system (DS2) that inherits the properties of both P2P and social networks. The system maximizes the autonomy of participants and does away with a central authority. The paradigm underlying this system enables the development of useful applications (e.g., a decentralized wiki) in which there is an incentive for users to contribute in a positive way. The ideas from DS2 could also be applied to many single-authority systems. Adding such a social network to services like Yelp, Amazon or Stack Overflow would allow users to ‘follow’ contributors that they find interesting, thereby providing new and useful ways to rank contributions and aggregate recommendations that would be more resilient to trust attacks. Before DS2s are adopted widely (and to encourage their adoption), further avenues for validating their effectiveness must be explored. We are currently working on a DS2 simulator, in which various behaviours can be modelled and pitted against each other, and the resulting payoffs for all peers can be evaluated. We are also exploring document retrieval and referencing schemes that could improve the latency and reduce the number of messages generated without compromising the autonomy of the peers or making the gossip protocol overly complex.
- Reddit users flee to Swiss copy Voat after harassment clampdown, accessed 15 March 2016.
- Reddit downgrades technology community after censorship, accessed 15 March 2016.
- Does Yelp filter positive reviews if a business refuses to pay for advertising?, accessed 15 March 2016.
- Study: eBay sellers gaming the reputation system?, accessed 15 March 2016.
- S. Macbeth and J. V. Pitt, Self-organising management of user-generated data and knowledge, Knowl. Eng. Rev 30, pp. 237–264, 2015.
- S. Ratnasamy, P. Francis, M. Handley, R. M. Karp and S. Shenker, A scalable content-addressable network, pp. 161–172, 2001.
- I. Stoica, R. Morris, D. R. Karger, M. F. Kaashoek and H. Balakrishnan, Chord: a scalable peer-to-peer lookup service for internet applications, pp. 149–160, 2001.
- A. Craig, A. Davoust and Babak Esfandiari, A distributed wiki system based on peer-to-peer file sharing principles, pp. 364–371, 2011.
- A. Davoust, A. Craig, B. Esfandiari and V. Kazmierski, P2Pedia: a peer-to-peer wiki for decentralized collaboration, Concur. Comput. 27, pp. 2778–2795, 2015.
- J. R. Douceur, The Sybil attack, pp. 251–260, 2002.
- A. Davoust and B. Esfandiari, User participation and honesty in online rating systems: what a social network can do, 2016.