Contents
Introduction
The method of selecting which peer(s) to get a file from affects how quickly file contents can be delivered to processes interested in reading those files.
If you're not already familiar with the design, it may help to start by reading (or at least skimming) the description of the peer to peer design.
Previous Approaches
The peer selection algorithm has evolved with the prototype through two different methods.
Random
The first peer selection method was totally random. When the tracker server provides a list of peers with a copy of the file, peers were picked at random from that list.
The concept behind this was that by spreading out the work evenly among all the peers, it would avoid directing too many requests to particular peers.
The obvious flaw is that this made no attempt to select a peer based on latency, bandwidth, or network locality.
IP Difference
The second peer selection method was to subtract each peer's IP address to the local machine's IP address and treat the absolute value of the difference like a distance. The available peers would be sorted by the IP difference, and the peers would be tried starting with the ones with the lowest difference.
For example, suppose the local machine's IP address was 192.168.1.10 and the available peers were 192.168.1.5 through 192.168.1.15. They would be tried in the following order:
192.168.1.9 192.168.1.11 192.168.1.8 192.168.1.12 192.168.1.7 192.168.1.13 192.168.1.6 192.168.1.14 192.168.1.5 192.168.1.15
This method has a much greater tendency than random selection to select peers which are closer on the network. For example, in most cases machines on the same local subnet (i.e. matching the netmask of a local network interface, reachable without passing through a router) will have a lower IP difference than machines on a different subnet.
One case in which this method performs less than optimally is when there are two separate subnets which have adjacent IP ranges. Suppose there are file caching agents on the two networks 10.12.34.0/24 and 10.12.35.0/24. An agent on a machine with the address 10.12.35.10 will consider 10.12.34.250 to be closer than 10.12.35.100.
Another performance problem with this method is that machines in the middle of the IP range will receive more traffic than those at the ends of the IP range. Suppose the file caching agents have addresses from 192.168.1.1 through 192.168.1.99. The agent with the addresses 192.168.1.50 will receive the most requests, because the 48 agents with higher addresses will try it before any with a lower address and the 48 agents with lower addresses will try it before any with a higher address. A potentially even worse case is when multiple subnets are involved, connected by a router with limited bandwidth. Suppose there are file caching agents in the three networks 10.1.20.0/24, 10.1.30.0/24, and 10.1.40.0/24. The agents in 10.1.20.0/24 will try agents in 10.1.30.0/24 before agents in 10.1.40.0/24. Similarly, the agents 10.1.40.0/24 will try agents in 10.1.30.0/24 before agents in 10.1.40.0/24. If the three networks are all connected to a router by links of the same capacity, the connection to 10.1.30.0/24 will get the most peer-to-peer traffic.
Considerations
There are multiple issues to consider in the design of a method for selecting peers.
Latency
When making a request to a peer, how long does it take for a response to arrive? Ultimately, the goal is to deliver file contents to processes reading the files as quickly as possible. The ideal algorithm would select peers with the lowest latency.
Network locality is obviously a factor in latency. The more layers of network switching and routing a packet must pass through, the longer it will take to reach its destination.
Network capacity can also affect latency. If a network link becomes saturated (by more packets trying to cross the link at one time than is possible), some packets will be delayed.
The latency of individual transactions can be measured fairly accurately. However, latency can change due to temporary conditions both in the network and on the peer machine processing the request. Neither overall average latency nor the latency of recent requests is necessarily a good predictor of latency of the next request.
Bandwidth / Capacity
Networks never have uniform capacity for the transmission of data between all hosts. Between any pair of hosts, the maximum capacity is limited by the minimum capacity of any link which must be traversed. However the available capacity between any pair of hosts is also limited by the amount of traffic between other hosts which traverse the same links.
Unfortunately, network capacity is very difficult for a program to measure directly. An estimate of capacity could be computed by dividing the quantity of data transmitted by the time taken to receive a response for each request for data. However its not clear that such an estimate would ever be a better basis for choosing a peer than latency alone.
Relevant to this discussion, ScottVenier pointed out the now classic essay It's the Latency, Stupid. KenSchalk also found the more recent Its Still the Latency, Stupid.
Locality
Networks can usually be viewed as a tree of switches and routers. (The reality is occasionally a little more complicated.) Requests from one host to another go up through the tree of switches/routers and then back down on another path. Almost universally latency is lower and capacity is higher the fewer layers of this tree a request must pass through.
Network locality is also difficult for a program to measure. The easiest determination to make is whether a request to a particular peer will pass through a router or not. This can be determined based on the local address and netmask for each network interface. It is also possible to determine how many routers a request must pass through, though that is significantly more difficult.
Compared to routers, switches are largely invisible to programs. However, depending on the number of layers of network switches between individual machines and the first router, they can be significant for performance.
Fairness
Distributing work evenly throughout a network of peers generally improves overall performance. Directing more requests to some peers and less to other will generally degrade overall performance.
However it's important not to ignore other factors when trying to distribute work fairly. (See comments above about random peer selection.) It does make sense to distribute requests fairly among peers with similar locality / latency.
Planned New Approach
How will a new peer selection algorithm work? (This essentially describes a work-in-progress design.)
Comparison Methods
The choice between peers will be evaluated by a series of different methods. If the information for some methods is not be available (e.g. we have no records of latency because we haven't previously communicated with the peer), then the selection process will move on to other methods. Generally, this moves from a more dynamic assessment based on current conditions to a more static assessment based on long-term conditions or unchanging inputs.
Method 1: Peer-Specific Recent Pseudo-Average Latency
The latency of each request to a peer will be measured (i.e. the complete round-trip time from sending the request to receiving the full response). Peers will be compared based on a pseudo-average latency over "recent" requests. This will allow the selection process to avoid temporary problems like network congestion or competing jobs on a peer slowing its responses.
Rather than using a true average of recent requests, we'll keep a pseudo-average. When a new request to the peer completes, we'll take the average of the current pseudo-average and the new response time, making that the new pseudo-average. This produces an estimate that is biased towards the last response time, but doesn't completely ignore previous responses. The main advantage of this is that it avoids keeping a queue of records of recent response times, adding new items to the queue, removing old items from the queue, and averaging across the queue. The data structure is simple, and the computation work is small.
We could adjust the bias by changing the formula from this:
new_pseudo_average = (pseudo_average + new_time) / 2
To a form that gives greater weight to previous responses:
new_pseudo_average = ((pseudo_average * past_weight) + new_time) / (past_weight + 1)
It's unclear how to best choose the bias, so it will be configurable. Possible defaults for past_weight would be 2 to 4.
(ScottVenier suggested this pseudo-average algorithm.)
To avoid treating temporary conditions in the past as indicative of future behavior, records of requests past some time horizon will not be considered "recent." It's unclear how to size this time window, so it will be configurable. The default could be set at 1 minute.
When the last request recorded in the recent latency pseudo-average is too far in the past, the selection process will fall back on the next method of comparing peers. When computing a new pseudo-average based on the latency of a new operation in this case, it will use the overall average latency in place of the pseudo-average to compute a new pseudo-average.
Method 2: Peer-Specific Overall Average Latency
An overall average latency for each peer will be kept. This would not be sensitive to fluctuations in peer response time due to temporary conditions. The recent pseudo-average should be a better predictor until it gets out of date. The overall average should be a better predictor when we don't have any recent latency data.
Method 3: Network Neighborhood Overall Average Latency
Peers will be grouped into "network neighborhoods," and the average latency for each neighborhood will also be recorded. If we don't have any latency data for a peer, we can use the average for its neighborhood instead. In general, that should be a good predictor of latency.
One method of grouping peers would be to apply the local machine's netmask to each peers address. This may not accurately represent network topology, but is a reasonable guess.
To allow for better representation of network topology, there will be a configuration variable where administrators can provide one or more CIDR specifications of network neighborhoods. If a peer's address doesn't match any of those, we'll fall back on using our local netmask.
Method 4: Local Network or not
In the absence of any latency data to evaluate the relative cost of two different peers, we would check whether each is on a locally connected network. In other words, we would see if it matches based on the address and netmask of the local machine. (Take the bitwise AND of the netmask with the local IP address and the peer's IP address, and see whether the results match.) A peer on the local network would be chosen before a peer on a non-local network.
Method 5: Number of Matching Leading IP Bits
For peers that are not on the local network and for which no relevant latency data is available, their relative cost will be evaluated based on the number of leading bits of the IP address which matches the local host. The more leading IP bits match, the closer a peer is (and therefore the faster it is likely to respond). The fewer leading IP bits match, the further away a peer is (and therefore the slower it is likely to respond). (Obviously any machine on the local network matches at least as many bits as the netmask.)
To avoid artificially ordering peers that are really equidistant, not every bit will be considered significant. For example, consider the following peer subnets all connected to the same router:
10.123.45.0/24 10.123.46.0/24 10.123.47.0/24 10.123.48.0/24 10.123.49.0/24 10.123.50.0/24
If all of these subnets are considered together, they have only 19 leading bits in common. However some of them have 20 leading bits in common with each other. If this is considered significant, then the first three subnets would treat each other as "closer" than the last three subnets and vice versa. That kind of artificial clustering would lead to an unfair distribution of requests between peers and could cause network bottlenecks.
It's impossible to know how different numbers of leading bits map to real-world network topology. In a large corporate Intranet using the class A private range (10.0.0.0/8), it could be that 16 leading bits of the IP will always match for machines in the same building and 24 leading bits of the IP forms the local subnet. Or for smaller networks it could be that 24 bits match for machines in the same building and the local netmask of each machine has 28 leading bits set.
Since different users may have different network architectures, the IP matching "distance" categories should be configurable as a list of numbers of matching leading bits. The default could be to only consider matches of 8, 16, or 24 leading bits. An administrator could change this with the configuration setting to 16, 20, 24, and 28 leading bits.
Other Selection Details
Comparing Peers on Local Network
Suppose there are several peers available, all on the local network, and no latency data is available for those peers. They would be considered equi-distant, so how should a choice be made between them?
Random selection is one option. Sorting them by IP address would be another. It's not clear whether either would have any benefit.
It probably doesn't matter very much, as once an agent begins communicating with peers it will quickly acquire latency data which will allow future selection to be more precise.
Randomize Amongst Peer Networks
For the case of multiple peers on non-local networks for which no latency data is available, it seems wisest to choose randomly. The network topology and capacity between the local machine and such peers may be asymmetric. Choosing randomly should make it less likely to accidentally prefer a set of peers that will all respond slowly. Also, it ensures fairness among the peer networks.
Sorting Peers
By using the different comparison methods outlined above, a list of peers could be sorted. That would provide a total order, allowing the selection of peers to simply take elements from the beginning of the list.
However in some cases the comparison methods will not provide any relative ordering between some peers. They would essentially be considered equi-distant from the local machine, or to have an equal estimated latency or cost. As described in the previous sub-section, it's probably best to pick randomly amongst peers with an identical estimated cost. To allow that kind of random picking among peers of equal cost, we need a partial order rather than a total order.
The data structure into which available peers are "sorted" should therefore be a list of lists. Each element of the outer list will be a list of peers considered to have an equal cost.
In some cases, it may be difficult to place peers into these different cost "buckets." Suppose we have four peers, A, B, C, and D. None are on the local network. All have the same number of matching leading IP bits, so would be considered to have an equal cost by that measure. However, latency data is available for A and C. A has lower latency than C, so would have a lower cost. Should B and D be grouped with A or with C? Or should B and D be grouped after C or before A?
If we group B and D with C, A will be preferred over either B and D. That doesn't seem like the best choice, as B and/or D could turn out to have a lower actual latency than A.
Similarly, it doesn't seem wise to group B and D after C (and thus also after A), as that would make it even less likely that they would ever be contacted.
Grouping B and D after A would still prefer A over B and D. That could be bad if B and D have lower latency than A (or even equivalent latency to A).
Grouping B and D before A would preferentially choose peers we have never contacted before. That could be bad if B and D have higher latency than A, but after communicating with them once we would have latency data that would select A next time.
Grouping B and D with A (and selecting from them randomly) would avoid preferring peers we have never contacted before over peers which have had low latency in the past. However it would make it likely that we would select a host we haven't communicated with before and thus build up latency data bout it.
Probably the final option (grouping B and D with A) will be used.
What if the Server is Faster?
There may be some peers which exhibit higher latency than the tracker (server). That could be caused by network asymmetries, network congestion, or just peers that are very far away.
When trying to select a peer, if there is reason to believe that none of the available peers will respond faster than the tracker it makes sense to go to the tracker rather than a peer. This is already supported by the network protocol for cases where peers can't be reached or don't have the file in question. (The latter can happen due to a race between the tracker sending a list of peers, and the peers on that list deciding to drop the file to reclaim disk space.) As a side benefit, the agent will get an updated list of peers, some of which may turn out to be faster.
To support this determination, we will need to keep an all-time latency average for the tracker as well as a recent latency pseudo-average. To be sure we compare them fairly, these latencies for the server will only be recorded for the analogous single-block request, not for other kinds of requests which may have different latency characteristics.
Simple Example
For the sake of this discussion, imagine there are just two peers available, which we'll call PeerA and PeerB.
First, a latency value for each peer would be selected. In order we would check for:
- Peer-specific recent pseudo-average latency that was last updated within the specified time window.
- Peer-specific overall average latency.
- Network neighborhood overall average latency.
It doesn't matter which kind of latency we select for each peer, as the values will all have the same units. If the latency value for PeerA is lower than the latency value for PeerB, PeerA would be preferred.
If we cannot determine a latency value for either peer, we would then check whether each peer is on a local network. If PeerA can be reached without going through a router and PeerB cannot, PeerA would be preferred.
If neither peer is on a local network, their distance will be estimated based on the number of leading bits in their IP addresses which match the local host's IP address. If PeerA has 16 leading IP address bits matching the local host and PeerB has only 8 leading IP address bits matching the local host, PeerA would be preferred.
Detailed Example
For this example, we'll consider the decisions of a caching agent running on a host with the IP address 10.12.34.56 and the netmask 255.255.255.0 (i.e. it's local network is 10.12.34.0/24).
Suppose that the tracker has given it a set of twelve peers that have a copy of the file that it's interested. We'll call thhose peers PeerA, PeerB, and so on through PeerL. Those peers have the following IP addresses:
PeerA |
10.12.34.45 |
PeerB |
10.12.34.67 |
PeerC |
10.12.34.78 |
PeerD |
10.12.23.45 |
PeerE |
10.12.23.56 |
PeerF |
10.12.45.67 |
PeerG |
10.12.45.78 |
PeerH |
10.12.56.78 |
PeerI |
10.12.56.89 |
PeerJ |
10.11.23.45 |
PeerK |
10.11.23.45 |
PeerL |
10.10.34.56 |
Let's note a few more things about these peers:
Host |
Local |
Neighborhood |
Matching IP Bits |
PeerA |
Yes |
10.12.34.0/24 |
24 |
PeerB |
Yes |
10.12.34.0/24 |
24 |
PeerC |
Yes |
10.12.34.0/24 |
24 |
PeerD |
No |
10.12.23.0/24 |
16 |
PeerE |
No |
10.12.23.0/24 |
16 |
PeerF |
No |
10.12.45.0/24 |
16 |
PeerG |
No |
10.12.45.0/24 |
16 |
PeerH |
No |
10.12.56.0/24 |
16 |
PeerI |
No |
10.12.56.0/24 |
16 |
PeerJ |
No |
10.11.23.0/24 |
8 |
PeerK |
No |
10.11.23.0/24 |
8 |
PeerL |
No |
10.10.34.0/24 |
8 |
The network topology might look like this:

Before it begins retrieving the file, the caching agent has these latency records from previous interactions with peers:
Where |
Recent Latency |
Average Latency |
PeerA |
2ms |
1ms |
PeerD |
none |
4ms |
PeerF |
5ms |
4ms |
PeerJ |
none |
10ms |
10.12.34.0/24 |
N/A |
1.5ms |
10.12.23.0/24 |
N/A |
4.5ms |
10.12.45.0/24 |
N/A |
5.5ms |
10.11.23.0/24 |
N/A |
11ms |
(Note: These latency numbers are not based on any real measurements. They are simply meant to be easy to read and understand.)
In order to make its first selection of peers to contact for retrieving the file, the agent will order the peers as follows:
PeerB and PeerC
- Estimated latency: 1.5ms
- Based on latency data for the neighborhood
- Estimated latency: 1.5ms
PeerA
- Estimated latency: 2ms
- Based on recent activity
- Estimated latency: 2ms
PeerD, PeerH, and PeerI
PeerD has an estimated latency: 4ms
- Overall average
PeerH, and PeerI have no latency estimate available, but match the same number of leading IP bits as PeerD
- See "Sorting Peers" above for why peers with no latency data are handled this way
PeerE
- Estimated latency: 4.5ms
- Based on latency data for the neighborhood
- Estimated latency: 4.5ms
PeerF
- Estimated latency: 5ms
- Based on recent activity
- Estimated latency: 5ms
PeerG
- Estimated latency: 5.5ms
- Based on latency data for the neighborhood
- Estimated latency: 5.5ms
PeerJ
- Estimated latency: 10ms
- Overall average
- Estimated latency: 10ms
PeerK and PeerL
PeerK has an estimated latency: 11ms
- Based on latency data for the neighborhood
PeerL has no latency estimate available, but matches the same number of leading IP bits as PeerK
Assume the agent is configured to contact five peers in parallel (the default). It will clearly select PeerB, PeerC, and PeerA. It will then randomly select two of PeerD, PeerH, and PeerI. For this example it will pick PeerD and PeerH. The latency of those five parallel requests are:
PeerA |
1ms |
PeerB |
2ms |
PeerC |
2.5ms |
PeerD |
3.5ms |
PeerH |
4ms |
Recalling the discussion above about how the recent pseudo-averages will be computed and assuming past_weight is set to 2, we have this new latency data:
Where |
Recent Latency |
Average Latency |
PeerA |
1.67ms |
1ms |
PeerB |
2ms |
2ms |
PeerC |
2.5ms |
2.5ms |
PeerD |
3.83ms |
3.99ms |
PeerF |
5ms |
4ms |
PeerH |
4ms |
4ms |
PeerJ |
none |
10ms |
10.12.34.0/24 |
N/A |
1.51ms |
10.12.23.0/24 |
N/A |
4.49ms |
10.12.45.0/24 |
N/A |
5.5ms |
10.11.23.0/24 |
N/A |
11ms |
10.12.56.0/24 |
N/A |
4ms |
In order to make its second selection of peers to contact for retrieving the file, the agent will order the peers as follows:
PeerA
- Estimated latency: 1.67ms
PeerB
- Estimated latency: 2ms
PeerC
- Estimated latency: 2.5ms
PeerD
- Estimated latency: 3.83ms
PeerH and PeerI
- Estimated latency: 4ms
PeerI is from recent latency
PeerH is from its neighborhood's latency
- Estimated latency: 4ms
PeerE
- Estimated latency: 4.49ms
PeerF
- Estimated latency: 5ms
PeerG
- Estimated latency: 5.5ms
PeerJ
- Estimated latency: 10ms
PeerK and PeerL
PeerK has an estimated latency: 11ms
PeerL has no latency estimate available, but matches the same number of leading IP bits as PeerK
So the second five peers selected will be: PeerA, PeerB, PeerC, PeerD and either PeerH or PeerI.
Implementation Notes
I had planned on keeping a recent pseudo-average for each peer the agent had communicated in the past and for the server, in addition to all-time averages for both. I had not planned to keep a recent pseudo-average for each network neighborhood. My thinking was that short-term changes in latency were more likely for individual hosts than for entire network neighborhoods, though that's a fuzzy argument and may not be true.
The implementation is making it simpler to also keep a recent pseudo-average for each network neighborhood. I can't convince myself there's a good reason to avoid that, so I'm just going to keep the neighborhood recent pseudo-averages. However, it does seem like the past_weight setting should be higher for network neighborhoods than individual peers (to keep single slow peers from quickly dragging down the entire neighborhood's pseudo-average). They'll be separately configurable, and the neighborhood past_weight will default to twice the peer past_weight.