An article by Delian Delchev originally published in bulgarian in his blog
In a previous post, I described the idea of creating a distributed architecture for the WikiLeaks site without the need of having hosting providers and hardware servers.
The idea is simple - each participant can download and run a small program acting as a web server and serves the files and the information from the site. Thus, anyone who wishes to participate and to help WikiLeaks may install on their computer this small software which does not take lots of resources. Resources are not a problem because sites of the WikiLeaks type to not take much space. There is no problem with Internet speed either as the many users together provide a huge capacity, while individually (one session, for a segment of the site) does not require high speed to access small web files; large data files will be transported through the peer to peer distribution technology such as bittorrent, recombining the speeds of the many participants.
(The WikiLeaks site consists of 2 types of files - web pages that are very small - 2-4 kb, and the document files themselves which are drawn as archives and will be torrent files).
Creating a distributed architecture for WikiLeaks is a technical challenge.
The system should be:
- as simple as possible - should not require any special knowledge to use it or to install it;
- fully automated in order to minimize intentional participation by the users in any operation;
- open source to eliminate any malicious rumors about potential abuse on the part of the software creators for purposes other than those proclaimed and supported by the users themselves;
- it should allow any PC to become a web server, even home computers or computers included in home or public networks;
- it should allow easy steering of users to the most accessible and closest server;
- able to successfully scale up to nearly 1 million hosts;
- maximum protected and secure;
- some operations performed by the system (such as information uploads) must meet the maximum anonymity requirements so that they cannot easily be traced to their originators;
- the information published on the web server must be verified and reliable so that abuse would be made impossible (someone uploading information that should not be there);
- it should use a public infrastructure in order to make it harder to stop the service and make the spread of information easier;
The combination of requirements for public infrastructure, open source, and anonymity is particularly problematic. There are not many stable and decentralized public infrastructures available out there.
WikiLeaks, on its part, it makes a good use of the bittorrent infrastructure. All the files are located there. Bittorrent (when using DHT and PER) can be decentralized.
But it is poorly designed:
- it is not very much fault-tolerant, in case of a dropped node, there may be a long time before the DHT tree recovers, or the network may split in unconnected DHT networks for a while;
- bittorrent (and generally Kademila) DHT is a "structured decentralization". It creates a tree, but his root node (router, boot strap node) must be known in advance (static), otherwise it can't be structured/created; a brand new node would not be able to connect, and in cases of node dropping, it can't recover easily. For example, if router.bittorrent.com disappears, the whole DHT network may disappear or become split in segments. I do accept that such risk is minimal and may be resolved manually;
- does not allow a (by default) search by name or by part of the name;
- does not allow (by default) searching inside and adding information to files;
- does not allow transmission of messages in the network (any);
- it is a totally non-anonymous structure. It's easy to detect who's who and to detect who is the originator of a file (trough smart scan-sniffing). This is a big problem in terms of concealing the sources.
- the ability to find a file by HASH through magnet link is big plus. No need of torrent files and trackers (with the proper extensions of the torrent protocol). The lack of a decentralized mechanism for the exchange of magnet links and the accompanying searchable information is a flaw. Such mechanism would completely replace the need to maintain a web server;
- although peer-to-peer communication between two people exchanging files can be encrypted, the DHT system itself is not protected, nor encrypted. In addition, the structure is vulnerable to Man in the Middle attacks and is extremely susceptible to spoofing - anyone who inserts hashes similar to the files they want to stop can block the transfer of each file. A single computer can destruct the very structure of the DHT (announcing itself as as close to the root node as possible and responding to all searches with fake nodes, seemingly attached directly to it). In terms of security the maidsafe-dht is better developed. (I have some remarks, as it can still be destroyed by an attack on the root node, and it is missing a white-noise cover ), but it is not supported by bittorrent clients thus the idea of public infrastructure is lost with all its advantages of better visibility and no authoritative control.
Despite all its shortcomings, I decided to bet on Bittorrent with a DHT infrastructure. The reason is very simple - anyone who has a BitTorrent client, whether or not running a web server which has installed the special software, can secure the infrastructure of the site, and that means easy reproducibility to millions. Technically, even if someone wants to stop the bittorrent, it will be "post factum", after the information has spread and became property of thousands. In addition, the bittorrent DHT protocol is easy to restore after a deliberate crash (albeit with some manual effort); each time a root node fails a new one can be created or a new bootstrap on the DHT can be done, or a new tracker can be created if needed and everything will start afresh. The technology is just as strong, as are its supporters. More supporters mean it would be impossible to stop it. And whether there will be such supporters or not, is not a matter of law, but a question of morals. If people believe something is fair and it's right for it to happen, it will happen.
So the following tasks/problems should be solved:
- first and foremost - how to exchange messages through DHT. I need messages, because the bittorrent protocol does not support a mechanism to automatically update files from a particular torrent, if it works in DHT (although there is such a technique available, albeit private for some clients if a tracker is present). Therefore, I must find a mechanism to notify the clients, running a web server, that there is a new version of the website package, or of one of the system files. Also, when there is an upload of a new file from an "anonymous source" it would be good to be able to notify others so that they can receive it.
- how to preserve the anonymity of the source of the message. This is a fundamental problem with torrent DHT. Even when using anonymous proxies like TOR, there is a way to force the node to reveal its real IP or to detect it. A serious modification of the libraries and the protocol is then needed, but I wanted to use the popular library libtorrent rastebar without any modification (simplification of the code, and easier upgrades) in order to have a better legacy.
- how to ensure versioning - a mechanism to let me know whether there is a newer version of the message, torrent, component, bypassing the shortcomings of the standard DHT network, which disallows searches by name or by part of a name and requires an accurate HASH string.
How to name the components?
In DHT there is only one element serving as a vector to the information - this is the hash of the file.
Here I use a trick - I do create a torrent HASH code with the purpose of the file (which is used as the system name) - it consists of 160 bits (20 bytes): the "WIKILEAKS" prefix, a single byte version of the protocol "00", followed by two bytes for the type of the code/file (0001 - 000F are certificates, 1000 - the main website, 1111 - messages, 1010 - unauthorized additional files to upload, 8080 - authorized files, which are trusted) followed by a 4 bytes ID for a different file (those that are not different are always 00000001) and the last 4 bytes are the version of the file.
The most important file - the one with the website is checked in the DHT for downloading with the hash pre-encoded in the client software. If the file is already downloaded, its authenticity is checked against a RSA certificate (public and private keys), whose public key is previously saved on the client. Thus, only the holder of the private key can publish. After downloading and being verified successfully, the client starts attempting to download the newer version (versioning part plus one). If the RSA authorization fails, the download is ignored. There is no way to put a fake file, or modify the versions, without owning the private key, allowing the publication.
The method for the web content is used in reality to download any file with the exception of the unauthorized files having the prefix 1010.
However, it would be possible for someone to publish in the DHT a file whose hash matches the hashes that I use. There are two ways this may happen - accidental and deliberate. An accident is incredibly rare occurrence because of the way the torrent hash is generated. Even if this happens, this is not really a problem because the majority of the users will receive the right peers, and even if some of them get the wrong ones, the subsequent verification will reject the wrong ones and keep the good ones.
It is also possible for someone to deliberately send wrong peers with the goal to put noise in the good ones, thus preventing the dissemination of the information. Then it is necessary to find them (the mechanism for doing this is the verification of the downloaded file against the node id of the DHT node, which announced the problematic peers) and to block them.
DHT clients should not attach blocked nodes and obtain information from them. But this is one of the major shortcomings of today's torrent DHT - there is no mechanism for isolation. And I made my own - one of the special files (authorized by RSA key) carries the information about the bad peers. It is sufficient to have (at least) one client in the network to know the private key for the publication of this blacklist and to apply the following algorithm - if the verification of any of the files does not pass, the harmful node seeding incorrect peers is verified and entered into the list (the process can be automated). The remaining nodes, after the list is updated, just isolate that node from the network (isolation may be clever and implemented with a smart node, which is misleading the bad nodes that they are registered with it, but in the same time the search will never reach it).
In order to try to keep the announcements mostly in the nodes included in a tree, which supports my algorithm, I use hashes for the nodes (the ones created and added to the network from my software), using the same algorithm as the files hash - WIKILEAKS000000000000 + unique identifier (created by a random algorithm). So the DHT XOR algorithm for determining the distance will always favor my nodes whenever in announcements or in searches.
Each different type of file is encrypted with a different RSA key. Any public RSA key is a file and can be downloaded from the DHT network. Each successive version is encrypting the private key of the previous. Starting clients will have 3 or more pre-loaded versions of the public key. So with a good security technique (avoiding holding the private keys at the same place by the same people) the risk of compromising the private key is reduced, since they can be replaced on the move for all clients without reinstalling the software. This will greatly hinder attempts to publish unauthorized information in the authorized list.
Thus, holders of private keys can post mainstream authorized files. However, they are limited with a fixed number, and their hash prefixes are known by the clients in advance. The only information that is dynamically changeable is possibly the magnet link to the files publishing information (from WikiLeaks). But these links will be on some web page in the web part.
Any other authorized publication (web, rsa keys, server software update, etc.) uses a different pair of RSA keys. The upgrade of each RSA key pair uses a new, different pair of RSA keys. Thus, although the keys are many, the risk of compromise of a key pair (by twisting or theft) is significantly reduced. We keep the ability for a swift reaction rapid and for the replacement of the keys and even of the software, before a serious damage occurs. This is valid for any compromised RSA group (unless they are compromised all at once, but this problem should be controlled physically).
I want for unauthorized information to be published through my client, and to be protected by the infrastructure (although not visible on the web).
Here I have two problems:
- how to notify the other clients that there is something (new), which is important to be downloaded and cache locally, although they will not publish it on the web;
- how to preserve the anonymity of the source, if I can;
- how to avoid letting MIM pattern-scanning devices identify what is transported by examining the patterns;
To counter pattern discovery, I use peer to peer torrent encryption (obfuscation protocol). It is also recommended to keep the information in a compressed archive with more crypto on top of it (so that the names of the files in the archive remain invisible). In the future, I will implement in the client an integrated archive/crypto engine adding noise (adding random and redundant prefix-suffix to the file).
To notify the clients I use the built-in ability of the peer to peer communication to extend the standard bittorrent protocol (and the libtorrent support), (bittorrent extension protocol - BEP), and I have created my own protocol.
A notification with prefix WikiLeaks-version-command (msg push) - hash of the file to be downloaded - is sent. The receiving party confirms or denies the protocol, but says nothing about the action they will take. If they refuse – they are a standard client. If not, this is my torrent client.
Here, however, there is a problem with anonymity. It is enough for someone to have one client (supporting the protocol), monitoring who is the first one announcing the new file - by following the IP address they will be able to locate it with all of the consequences (possible search warrants, court order, etc.).
In order to reduce the risk, I use the technique of creating random noise - the new file is not announced to all peers, but only to one of them, randomly selected, and only after a random interval of time (with a delay of up to 4 days, 2 days on average).
He, in turn, announces it to someone else observing the same algorithm. This isolates dramatically the possibility of tracing the original data source, because nobody knows the order of making the announcement and after what time. Also, one cannot assess who has the file and who doesn't, since the announcement is not carrying any information preventing the looping; the trace-back is not possible without observing at least one third of all the peers on all of their access networks. At some volume (over 2000 peers) and with geographical distribution, this will become extremely difficult.
This technique allows also for better mobility; announcements could be made from the web/internet coffee shops and other public places. The hash of the uploaded unauthorized file will be random (different from the other algorithms) and therefore, technically speaking, tracking the entire range of 160 bits will give extremely low probability to differentiate a new file from the normally published files in the DHT (i.e. the noise).
The disadvantage of this technique is the lack of loop prevention mechanism (prevention of looping) – the announcement can be send to a peer who already has the file. This is not a problem because of the very large timeouts (which reduce the possibility of flood, while the number of clients, included the infrastructure through my software, is less than 400000, which is a pretty good number).
A deficiency that remains is the fact the time for redistributing a new file between all peers would be equal to a maximum of 4 days (1 + 1 / 2 + 1 / 3 + 1 / 4 + .... 1 / n) where n is the number of peers. But this is no big drama, because even with 1000000 peers, the maximum time would be somewhere around 14.4 * 4 days = 57.6 days, however, it will cover 80% of peers in 20-25 days and files will be exchanged until then.
How do I find the peers I can communicate with? These are all those who announce they already have loaded the hash of the web server files. (I'm searching for a previously known hash, and then I get its peers).
Fast extraction of the peers is a problem for their anonymity. On the other hand, publication of information is not a crime. Crime is to steal, not to publish (which is protected by the First Amendment in U.S. Constitution and the total freedom of expression and the laws of priority of public interest in Europe). This is something even DANS (Bulgarian secret services – State Agency for National Security) faced in our country (The site Opasnite – the Dangerous) so this is a stable legal rule in the Western world.
Threats of being an accomplice cannot be a problem either when using such client and publishing information. As stated above, one can be accused of complicity in stealing the data, not publishing the data, and there must be premeditation, which totally exonerates all users of my torrent client, since they do not initiate any particular publication. The common and indirect responsibility does not exist in such cases because otherwise highway builders could be sued for having traffic incidents on the roads they created.
My software attempts to capture two local ports (for serving the web) - 80 and 18880. The second port is spare, for protection, because not everyone will be able to open a local port on the 80 (limited by security or software running out there). Then, with UPnP, these ports are trying to open, automatically, on the local firewall (if you're behind a home router). This means that in 80% of the cases, without having to do anything, you will have Internet access to a WikiLeaks server. You don't have to know how to configure a firewall for that.
How Internet users would use this infrastructure?
First, those who have the client installed on their computer may access WikiLeaks locally - on port 80 or 18880. This is the most secure way - download the client and wait for synchronization, then use it locally. This way you are co-building the WikiLeaks infrastructure.
But we cannot expect from everyone to install a local client to access WikiLeaks. We need to provide access to those possessing a regular web client.
To do so, we should be able to direct the browsers to the IP address of a client currently running a web server. We need a DNS for some domain (WikiLeaks.ch?) and a configuration directing to the IP addresses.
A small software (script for DNS of the PowerDNS type) can extract all seeders (peers) for the Web from its hash (similar to the technique that I use for announcing messages). Then it can check slowly which ones of them has port 80 open and announce the correct web information. After that, tested peers enter a list in which a DNS query returns the IP of the web server. It may return to the nearest IP geographically (distance between autonomous systems is public information) or even simpler, to subtract the address of the querying machine from the addresses and return the one with the lowest score of absolute value.
Thus, clients looking for WikiLeaks.ch will be directed to the closest working IP address even if this is a home PC. If this IP dies, another one will be announced automatically (within 15 min) upon request.
Besides speed and fault tolerance, this scheme disallows easy monitoring of the web servers (and their demolition), unless one sends DNS queries from all networks where such servers are available (and this is impossible without prior information).
This structure would achieve more than the average resistance as infrastructure (giving the current technologies and attacks), and will allow a truly distributed and decentralized cloud (a standard API on the corresponding web server is sufficient to achieve the definition).
The domain of the organization, however, remains a bottleneck, as it is under centralized control. The DNS is the weakest Internet protocol from a security perspective; it's easy to shut it down and this is enough to destroy over 90% of the Internet we know. In the particular case of WikiLeaks - because most of the GTLD are (still) subject to indirect control by the U.S. MD, domains can be quickly stopped (as we saw with WikiLeaks), without the need to prove that the activity on the site is illegal. If the domain name is suspended, most customers will not be able to connect to web servers, which significantly reduces the availability.
The weakness of the DNS is a weapon in the hands of those who want to circumvent the court proceedings and to rely on authoritarian mechanism imposing a fast decision. That's why the RIAA tried a long, and finally unsuccessful, lobbying on the telecom package of the EU, DMCA. In the US - ACTA eliminated mandatory ISP traffic blocking and content providers’ assistance without a court decision. Also unsuccessful on a global level were the attempts to convince ISP "to decide themselves" to be cooperative. These, along with new problems raised by recent movements promoting net neutrality in the US and the EU (they indirectly claim that blocking of services should require a legal decision to be announced illicitly prior to the blocking), motivated the RIAA to focus on GTLD with the idea to make them remove domains without waiting for a court decision; following a simple complaint (extremely interesting ongoing case is the one with rapidshare.com).
In the private case of my client, I address the DNS problem through 3 ways:
- I assume WikiLeaks will work with a (really) independent GTLD (as they tried to migrate to .Ch domain). This makes it extremely difficult to stop the service before it is accused of a crime and without a judicial decision (and even then, it is subject to appeal by the international standards in Switzerland). There is no way this will happen by Hillary just saying "stop the domain."
- Local access to data if you install the client (if you do not have access to a server, just install the client and you already have a server).
- Multicast DNS – the new advancement of the DNS protocol, on top of it supported by many modern operating systems. The DNS query is relayed to a multicast group on port 5353 (from then on, everything is the same) and many machines can answer locally who deserves this domain. My client tries to bind on port 5353 for multicast DNS (even if there is something there already), fixes the local hosts caches, and if Bonjour service is found (Multicast DNS service from Apple), he reconfigures it. So, even if the international DNS fail, the local will continue to work and to submit accurate information. As long as you have one "WikiLeaks" client in each network segment, there is no need of GTLD and control of information on this level. Spoofing and flооding attacks are not working (their effect is only local, possibly) and the system becomes much more fault-tolerant, but mostly decentralized and uncontrolled by the authorities.
I have everything described here coded in a proof of concept Python-based torrent client and web server, with literally 300 to 400 lines of code. What I am going to build next is a nice GUI, which I will release as a demonstration client, not seeking anything other than to show the idea. I chose Python as there is a simple port of the libraries I use; programming is quick and relatively portable, and enables close cooperation with the operating system. Sorry, but Java is not my forte, and C++ would require more time to develop the concept. If I decide to write a real client, C++ will be my choice, but the concept should be easy to modify.
The idea itself is interesting. It allows major distribution, fault tolerance and performance, using relatively small resources, (on domestic computers leased by volunteers).
One such infrastructure can support multiple sites and even applications, which can be downloaded as a Python plug-ins and upgrades described in the standard way to exchange files and messages. When defining a standard API and framework, it would develop like a full cloud-distributed service, without a guarantee of performance (but also with no maintenance cost and significantly increased fault tolerance to random attacks and strokes). In addition, individual clustering by MPI API and distribution of the web requests in a decentralized way from one client to another can be further developed very easily, since each client can learn where his neighbors are, and then, with a minimal protocol extension, can learn their load and serve as a local distributor of requests. Through a structured algorithm (similar to DHT), these applications, even with local distribution redirection, can get globally controlled synchronization, and thus get all the services that a typical cloud infrastructure is offering today, but with one step forward.
Such infrastructure could also serve torrent sites like thepiratebay or arenabg, so they will become resilient to being stopped (as infrastructure). This can be achieved without significant modification and with a very limited effect on the way the services are operating.
The practical impossibility to earn from advertising (not through my client, at least) by respective organizations, due to the difficulty in creating content and user dependent Ads (constraints of the architecture and striving for anonymity), would be a disadvantage, and hence there is no potential for commercial interests, unless the clients are modified so that they can be used as some form of distribution of ads.
In any case, such systems offer a fairly large extension of the concept and of our vision of what exactly constitutes an infrastructure and how (and whether) it can be controlled at all. This idea demonstrates that technology allows (and therefore does not allow) imposing views authoritatively, by relying only on maintenance of the machines. You cannot filter something easily, blocking or stopping any server, because you just don’t like it. If there are people who are interested in this service, it will exist. If the "cause" is "wrong and immoral", people should be first convinced that this is the case, and the need of applying force will be minimized, because no force will work, under any form, if there is a mass support for the "cause".
Saturday, 28 May 2011
Decentralized Infrastructure for Wikileaks