Archives

Plenary session
25 October 2016
5 p.m.

BRIAN NISBET: We're going to begin. Right. Hello and welcome to the Tuesday afternoon Plenary slot. So, this is the ‑‑ it's the last slot of the day which you can all tell from your programmes. We do have a slight change in the agenda in that we'll move one of the talks around and you will figure it out quickly.

First of all, could I ask the four RIPE PC candidate nomination candidate thing/people /stuff to come up on stage and ‑‑ I see three of them. Is the fourth one there? No... no... okay...

So, again, the RIPE PC works on volunteering, like the entire community works on volunteering, so, we are very thankful from the point of view of the PC for anyone putting themselves forward and to all the hard work that everybody does. These ‑‑ your ability to get on the PC happens very regularly so if you are sitting there saying damn, I should have nominated myself earlier today, you will have a chance at the next RIPE meeting.

We will be sending out the URL for the web page that has the bios of all four of our candidates, and the voting will run until after ‑‑ very shortly after this piece ‑‑ until Thursday afternoon, 17:30 on Thursday afternoon and then the results will be announced on Friday morning.

Right now, what we're going to do is each one of the candidates here will get a minute to introduce themselves, just to tell you who they are. I'm going to start with Falk.

SPEAKER: My name is Falk von Bornstaedt. I work for Deutsche Telekom for 25 years, so I am on the Internet for 20 years now which was quite a lonely thing at the beginning. For the last ten years I am attending RIPE meetings and I would like to give something back because I have received so many interesting things here. I would like to do more than hosting a RIPE Atlas probe on my private TSL and do a little bit more for the community.

SPEAKER: I am Leslie Carr, I work at Clover Health and SFMIX. I have had the honour and privilege to be on the Programme Committee for the last two years. I think that me, along with my fellow committee members, have, you know, been really lucky to have very excellent Plenary sessions and I would like to continue my work as a part of the Programme Committee.

SPEAKER: Hello, I am Maria, I work for a non‑profit organisation, a consortium of universities in Catalonia where where we manage the regional network and the Internet exchange, so I have contact both with the research and the ISP people and also the IXP members, and I would like to help with the view from the south of Europe.

SPEAKER: Hello, this is Dmitry Kohmanyuk from the Ukraine. I have been to many RIPE meetings for the last five years, I also served for ENOG, Eastern Europe Programme Committee, so I'd like to try myself at this new thing and help to improve the content.

BRIAN NISBET: Thank you all very much. Thank you all very much.
Thank you all very much and indeed, thank you for volunteering. And importantly I would like to say that the other outgoing PC member, Marcus, who is chairing this session with me, will not be standing again so I'd like to thank him very much in the last two years for the work that he has done.

(Applause)

So, now, we can move on with the rest of the meeting. This is the change. So I'd like to invite Elise up on stage please. There are slides...

ELISE GERICH: My name is Elise Gerich and I have been attending RIPE meetings for a very long time wearing many different hats and today I'm wearing yet again another hat, I am wearing the PT hat instead of the ICANN hat. So one of the things I'd like to talk to you about today is an introduction to PTI and then provide the IANA services update.

So, I need to find out ‑‑ so, PTI stands for public technical identifiers, just in case you didn't know, but I'm going to go back. So there was a request from the US government department of commerce to come up with a proposal from the multistakeholder community, so, this slide represents the three communities that got together to put together a proposal to the US Government. And this proposal was then combined and you had many, many representatives from the RIRs who worked very hard on the CRISP team to put the proposal together for the numbers community.

So, when all those proposals were put together and combined, you get this wonderful schematic which basically shows you how each of the communities, numbers, names, protocol parameters, wanted their relationship to be with the IANA services in ICANN.

So, one of the communities, the names community, had decided that they proposed that there needed to be a separation between policy and operations, and they were the only ones that actually had had issue, but the RIRs and the IETF decided to go along and create a new entity which we're calling PTI. During the proposal phase it was called post‑transition IANA. But, that name didn't fly when we went to incorporate so it turned into public technical identifiers.

So this is kind of a big picture are what PTI overview is. There's the legal part where you have to be incorporated, you are a 501 C3, which means you are a non‑profit in the United States. The organisation, you have a PTI board, which was ‑‑ has been appointed. You have officers, in order to incorporate a new affiliate or new company you have to have officers, so I was named the president. I am now President of PTI, and the two other officers of the company are Becky Nash, a treasurer, and Samantha Izner, who is the secretary. All the staff of PTI are seconded or lent from ICANN to PTI, so we're all on loan for the time being to form this new entity.

So what does this mean to the RIRs? That's I'm sure what you all care about. So, the RIRs have a contract with ICANN to provide the IANA services, the services that we have been providing for quite a long time. Allocation of IPv4 addresses, obviously we only have the recovered pool now, so those are the only IPv4 we hand out. IPv6 addresses: We haven't handed out any of those since I have been at ICANN, which is six years, but I hear tell that we might sometime in the near future. And autonomous system numbers. So those are primarily the servers we offer to the RIR community.

So, the RIRs have a contract with ICANN. ICANN sub‑contracts to PTI, which is the operational side of the house, and then ICANN is the one that's accountable to the RIRs. So, there is a contractual commitment all the way through.

So, the output is pretty much the same as it has been in the past. There'll be standard reports that are generated based on what allocations are made to the RIR communities, and there'll be monthly reports posted on iana.org.

Now, this chart is something that's slightly new and different for the organisation. In the past, the IANA services budget was just part of the overall ICANN budget. Now, since we're an affiliate, we have our own budget which, based on the proposal that went to the US Government, has to be done nine months in advance of ICANN's budget. So, we have been working on this for a bit of time and a public comment was posted just yesterday for the PTI budget, which will be a component of ICANN's budget, because PTI has no revenue, it's a cost centre, and all the money it gets it comes from ICANN, and so would he have to submit our budget to ICANN and then ICANN approves the budget when the community does.

So, if you are interested in how much it costs to provide the IANA services, this year there is a 10 million funding is planned for the IANA services. That includes the direct employees that work for PTI. It also includes the shared resources from ICANN that we need such as HR, legal, facilities, things of that nature.

So, people have asked me, well, how do I get in touch now with PTI if I need something? The various RIRs, how do they reach us? Basically, PTI is just an affiliate that how houses the IANA services, and you contact us the same way you did in the past. You can just send your questions to IANA at iana.org, or you can go to our website, which is www.iana.org or you can reach us via our e‑mail addresses, and that one is mine, elise [dot] gerich [at] iana [dot] org or elisegerich [at] icann [dot] org. We have both, because if you just have the old one plugged into your laptop, you can still reach us.

We don't have a pti.org address because someone already had that domain name. They also had pti.net and pti.com. We looked at some of the new gTLDs, some of them that people thought were pti.earth, pti.services, ptiglobal, but decided to stick with iana.org, it's shorter.

So that's my short introduction to PTI and hopefully you'll now understand what that means when you hear that PTI is providing the IANA services. And now for my short IANA services update.

So, the IPv4 recovered pool allocation happens twice annually, that's based on the global policy that was passed by all the RIRs. Our last allocation was made September 1st. Each RIR received an equivalent of a /18, which is not a whole lot of individual IP addresses. The next allocation will be March 1st. I have put the URLs up here and in the slide deck if you want to see the programme that's run every time against the recovered pool. You can run it yourself to see what RIPE and the other RIRs will receive next time.

This is just a screen‑shot. The IP address blocks in red are the ones that RIPE received this time as part of the allocation on September 1st.

And if you're curious, the way the global policy is written, we keep handing them out based on the policy until we run out of them. And right now, based on receiving no more IPv4 addresses into the recovered pool, we'll run out of them in 2019. But, everybody gets just smaller and smaller and smaller little bits between now and then. So, as long as the pool stays the same size as it is, that's when the last allocation will happen.

So, I want to thank you for your attention. And this is a photo of all the staff that have been doing the IANA services for many years. Michele Cotton, who is in the picture, has been with the IANA services for 16 years. Two of our staff members have been there for 11, and the rest of us have been there between six and two years. So we look forward to continuing to provide the service that meets your needs and thank you very much for your time. If you have any questions, I'd be happy to take them.

(Applause)

CHAIR: Okay, we have time for one question. Okay. Then, thank you very much.

(Applause)

CHAIR: So, we'll continue with your regular programme now and our first speaker is Riccardo, who will look into large scale Internet outages.

RICHARD OLIVEIRIA: Hi everybody. I am the CTO and co‑founder of ThousandEyes, and I am just here to talk a little bit about a capability that we have which is called Internet outage detection and go over some example of outages we have detected in the recent past. This is just a way to do outage detection in Internet in a way that is a bit more structured or automated, rather than, you know, receiving e‑mails in mailing lists and you know crowdsourcing outages and so on.

So, what is ThousandEyes. This is based in San Francisco. We basically measure network performance using an infrastructure of regions. And just to give you some idea. We have about 125 locations worldwide of Cloud agents. So this Cloud agents are managed by us. And they are in major matters right now, and we have a process of adopting these agents, making sure they don't do bad calling across 20 hops until they get to their destination application, they shall usually close to a major service provider, each a major national backbone or a tier 1 provider. So they are fairly civil. Besides these network of Cloud agents we also have about 300 plus locations with enterprises and these are customers that download and install inside their own environments, mostly medium‑ to large‑sized enterprise networks.

So, what data we collected from these agents. We mostly collect application data including http data, DNS and, more conveniently, voice. So, these agents run what you call tests which are basically, it can be, for example, HTTP get, and they do periodically and then we keep different metrics depending on the application we're talking about. We do also real browser tests using Chrome so you have a more application comprehensive view, and then what we do is, at the same time, we do the application test, we also do network level tests, so we measure things like end‑to‑end delay, jitter and loss between the agent and the application being targeted and we do this by, not by using ICMPs but by using traffic which looks like application traffic from the point of view of the network and the application but it is actually our own traffic. It's like background traffic that gives this information. That includes also the passwords information which you see a trace route, if you will, where you send the TTL limited packets and then we try to reconstruct the paths that the application packets actually take.

We also collect routing information, mainly from route views, we used to use RIPE as well and we ‑‑ we might use that in the future as well. The reason we use route views is because of the intervals they have, it's more consistent with your data. And we have about 350 or so routing tables from different peers.

So, at the high level there is two type of outages that we classify. There is the traffic outage detection, which is based on data plane, so every time you have 14 loss that you know, it's across a certain service provider, we're able to detect that. And classify that as a traffic outage and I'll go into more detail how we do that in the next slide. And the second type of outage is the routing outage and these are mostly control plane reachability outage, when we see a significant portion of prefixes being removed from routing tables in a certain geography. We detect that and, you know, signal that as an outage event. An outage can be anything, if you look at traffic outage, for example, it can be anything from an event of five interfaces to something affecting 1,000 interfaces. So, we basically have a very minimal threshold for what is an outage, but then we provide the context of how big is the outage in terms of interface as a factor as well as locations.

So, how do you do this for the traffic data plane case? This is an example where you are the agents on the ‑‑ you have the agents here on the left side targeting applications on the right side, and each node there is a layer 3 interface of a router, typically. So, what we do is superimposition of all these tests over time across all our customer base and across all the applications they target, and then as we see patterns of loss in the middle of the network around certain service providers, we are able to detect outages, and of course one of the big challenges here is to remove all the noise, because there's always the voices losing packets and it becomes almost like background noise all the time so we have algorithms that make sure we filter those cases. The other thing is, we also exclude the edges from the outage detection, so once we get into the enterprise network side, we don't look ‑‑ we exclude that, so firewalls and load balancers and proxies and all that stuff are typically very noisy and drop packets, sometimes randomly. We just don't look at that.

So, this is an example of an outage in this case in tele, so you get an idea of how far the visualisation looks like. So, we three main pieces of information here; we have what we call the affected tests here, number of affected tests here, so this is across our entire customer base, so we know there is 36 tests that were affected by these outage, and typically a test maps a different application.

We know there's five locations or five points of presence and also there were 126 router interfaces involved in the outage. The other piece of information we provide here is what you call the terminal interface contact, so every time an interface is seen as a last hop in a path we are able to tell how many of the tests are actually terminating in the interface. And as well as, you know, the fraction of tests that are not affected by the interface. So this answers the question am I the only one being affected by these outage or is this ‑‑ is someone else also affected or part of the traffic was not affected? So this helps sends me some of those questions.

This is an example of a routing outage. In this case, the root cause was in Hurricane Electric, so you have information such as number of origins involved here. This is RAS origins, you have number of affected prefixes in the outage. You have the location, United States. So this, what happened here was Hurricane Electric was in first route cause of an outage affecting 171 prefixes, that appear from routing tables, for a short period of time and that created reachability issues. The graph you see here on the bottom, this map here, the diamond shaped nodes are what you call monitors, which are the route view peers, or routers, and then each node here is circle here is an autonomous system. So you see the connectivity between autonomous systems. And you also see the green node there, that's the origin basically that is advertising the prefix, or the prefixes.

So, after we started this project, we started collecting data and triggering outage events and we did some stats and what we realised is outages happen all the time, it's actually a fact, so not only the data plane outage but also routing outages, these are almost like normal state of the Internet. Now the question is ‑‑ so you see some spikes on the top graph. And we know that, you know, sometimes there is outages that affect a big fraction of the Internet or at times there is just a very local regional outage that impact only a small population of users.

These are just some of examples of outages that we saw recently. I'm actually going to skip this.

And I'm just going to go over some of the examples of the outages that I have here. I have four. The first one is a high impact outage on basically the traffic outage in Telia, so which affected more than 120 interfaces across five locations. You see the epicentre being located in Ashburn, Virginia and also London. So most likely it was somewhere in a connection between Ashburn and London, something happened there. We don't know exactly what. Or at least I don't know. This is how it will look like in a path visualisation. You can see the agents on the left side. The two orange nodes and you see the red circle nodes are the interfaces where the packets are being dropped basically and they are all in Telia, Telia network, and this was an outage, in fact, affected a bunch of customers, you'll see it was close to 58 total here, so it was a very, you know, widespread in scope outage.

The second example is a combination of a routing and a traffic outage. So involving Hurricane Electric. This is ‑‑ the left image is before the outage happened, so you see all the monitors are green in terms of reachability, so there's no issues there, and then in the next, in the right side you see some of the routes being withdrawn and you see the monitors becoming orange and red so they are reachability to Hurricane Electric so they actually flap between the two.

What that created was also some loss in the data plane, so there were some packet loss in the middle of the network and you can see the nodes here are actually inside Hurricane Electric, and this also trigger a data plane outage. So this is the case of a routing outage that also triggered a data plane outage.

The third case I have here is ‑‑ involved an application that some of you might actually use called JIRA. So what happened, if you look at the graph on the left top, we show the http availability. There is a drop there for some time, mean that the http levels were not working. The graph below was similar time, same time, but looking end‑to‑end layer 3 loss. You see a spike there close to 100%. You see the outage context on the right side. It seems there was some issue around Ashburn, and that affected the certain number of interfaces as well, not as big as the previous one I showed.

What is interesting here is this also, there is also a routing outage involving this one. So actually JIRA is served by a /24 and that became unavailable, and there was a /16 covering the /24 from the control plane perspective was fine, it was reachable. So, traffic ‑‑ but it was announced by NTT, not Atlassian, so traffic started to go to NTT and NTT basically didn't have any ‑‑ I mean, it was basically dropping the packets there. So, on the left‑hand side here, is the case where the application is working normally, and is load mounts between level 3 and NTT and being announced by Atlassian properly. On the right side you see where the packets are going to NTT and being dropped and NTT.

The last example I have is one of a cable cut. This happened sometime this year. So this is, started to affect some service providers including, Tata and TI Sparkle. This ‑‑ you can see, this is normal behaviour when you do a path rate from here to here. So everything seems to work fine. And then you start seeing in the drops here. So the red circles are basically a peering, meaning packets are being dropped inside of Tata. And the same symptoms start to appear in TI Sparkle. And we see cases like NetFlix that started to reroute traffic that usually would go through Paris and Marseilles now is going through Frankfurt in Germany. And if you look at that region in terms of what cables are connecting the ‑‑ that European area in the Middle East, so there is the 3 and 4, and both stat NTs make use of that infrastructure. And this is the SEA‑ME‑WE topography in more detail. You see the connectivity here, it goes from southeast Asia to France.

And so that incident that started in Europe was actually propagated to different regions including South America and Buenos Aires, and also the US and other regions in Europe. And the reason these also affected, you know, the South American continents is because there's actually a part of South America that rely on that cable to connect to Europe and because then that ‑‑ because of the outage, that created some reachability issues and the BGP sessions started to fail and you start to see some more withdrawals. Eventually we have an acknowledgment have SEA‑ME‑WE 4, that that was caused by a faulty repeater on segment 4, which is between Cairo and Marseilles.

And that's basically where the problem happened.

So this is my last slide, and before I go to the questions, I just want to say that all this collection of outages that we have been detecting we plan actually to make these available in a portal, especially the larger scale ones so that people have access to that so anyone could access it and at least from a certain perspective what any considered outage at a given time.

So if anyone has any questions...

BRIAN NISBET: Thank you very much.

(Applause)

Does anyone have any questions?

SPEAKER: Alexander Azimov. I have a question of power of detections of outages. Okay. What are you doing if you are not able to make traceroutes from both sides? How do you detect ‑‑ how do you distinguish the router where the outage begins?

RICHARD OLIVEIRIA: So your question is if the trace route doesn't work from many side. So the way the ‑‑ that we do the path trace, we start sending a CMP limited packet. The only case we don't have any data on trace route is if you are an M resolution, if the DNS server is not working.

SPEAKER: Okay. But an Internet, most of the Internet is ‑‑ and when you are making trace route, you see only one way of traffic. Maybe the drop is on the other side. So, how do you distinguish on which side of the traffic, forward or backward, is the outage?

RICHARD OLIVEIRIA: Yes, so, we ‑‑ obviously because the problem can also be the time you see the packets being dropped in a reverse path and we don't have visibility on that, but because we have a minimum critical mass, so it can be in for ‑‑ we can point to a certain ISP that gives some extra level of confidence, but this is more of a hot spot. In some cases it may not give you the exact location of the outage. And the case where is that might happen in particular, if the loss happens in a border to the two ISPs, we always go to the ISP before, you know, the drop started to happen. So if you have two IXPs, A and B, and something happens in the middle, we attribute the loss to A even though the problem is on B, because that was the last known hop. So that's the only case where the inference can, you know, not be 100% accurate, but we're working on an algorithm now to fix that, or mitigate that to some extent.

SPEAKER: Okay. Thank you, I will be glad to hear how you are going to mitigate this problem. Thank you.

BRIAN NISBET: Any other questions? No? Okay. Thank you very much.

(Applause)

So, next up I'd like to introduce Richard Sheehan, production engineer who doesn't seem to be as blatant as his colleague earlier about who he works for, but it's rumoured that it's a large content provider that does cat pictures or something, I'm not entirely sure. So, Richard, thank you very much.

RICHARD SHEEHAN: Mostly cat pictures. My name is Richard Sheehan, I am a production engineer which means I am an SRE at Facebook. I am going to talk about network fault detection at Facebook. So, it's four sections of this talk, firstly network monitoring and why it's good to have it. Secondly, it's important to understand why bigger networks need different ways of monitoring. Thirdly, Net Norad, which is the first system we developed at Facebook to try and find loss within our networks. And then I'll talk about NFI, which is the thing I have been working on for the last 12 months. It stands for network fault isolation.

Network monitoring.

As many of you know, if you live in a world where there is no network monitoring it's a pretty miserable place, for one big reason: people tell you when the network is broken and they don't tell you nicely that your network is broken. The worst part about it is ‑‑ tell you when they think the network is broken, which usually means, it's actually not when the network is broken it's when they just push software and something broke and they said, oh, it must be the network because last time something broke it was the network, so every time something breaks it's the network. And most of us in this room have been there before. And ultimately, if you are a good network engineer you have to investigate every one of these reports every time to see which it is. And that doesn't scale in a large scale environment.

So, how does a network engineer investigate these things? He runs pings, traceroutes, so, you do that. Ideally, you figure out where the packets are going missing and you fix it, whatever it happens to be.

So, this isn't actually a lot of fun on a day‑to‑day basis unless it's a really interesting programme. In a data centre network they tend to break in the same way, and most of the ways of breakage you actually run out of capacity, which is embarrassing, or you have a bad link or a bad device. So the first thing you want to do is build some graphs and monitor that data. There's two ways to get that data: firstly, SNMP. Quickly you realise that you talk to your vendors and you find out they don't expose all the counters, and you end up with this; or Python, depending on what you use these days. Then you build dashboards, dashboards, and your dashboards gets bigger and bigger and that's okay, except this whole process is fundamentally flawed and it's fundamentally flawed for one simple reason and you get bitten by this in a large scale environment because it starts to happen. You are trusting the network device to tell you that it's broken and, if it's broken, you shouldn't trust it. So this comes down to one of my principles of monitoring a network. If you didn't design and build the hardware yourself, you absolutely shouldn't trust it to tell you it's okay. You say don't you build your own networks and switches now. You're damn right we do. And if you build it yourself, you really can't trust it to tell you it's okay because you know all the components that went into it. Most importantly, you know all the software that's gone into it and you know how badly you write software and how the people you work with write software. Again we never want to trust our equipment or software.

So, what do our networks look like. The datacentre networks is what I'm going to talk to you, which is the easier problem to solve. As we have talked about many times publicly, we like to big big CLOS fabrics. I have a lovely big picture of what they look like here. It's very pretty. But what it relies on is something important. For the purpose of this talk, I'm going to use this diagram we have here. There is a host on the left, there is a host on the right in the other data centre. We have a switch, cluster switches, we have spine switches, inter‑data switches, spine switches in the next data centre, etc. This is going to be related back to this diagram. What makes these networks work is ECMP, equal cost multipath.

But to quickly explain it. When host 1 wants to send the traffic to host 2, he is going to send it to a rack switch, he is going to go, I have got four uplinks to four different cluster switches, I have got to pick one. He is going to look inside the TCPI packet, take these five fields ‑ protocol, source address, destination address, source and destination port ‑ and he is going to hash that and use that hash value to decide which link to take, so he's got four options, he is going to pick one.

This process is going to repeat until ultimately we get right to the end user host. That points out one of the problems we get in a large scale data centre environment. Somebody says the network is broken and says, look, I gave you a trace route, that's great, you showed me one of the thousands of paths between this host and the other host. That's wonderful but I still don't know where it's broken.

One of the things we decided to build sometime ago is a Facebook's version of trace route ‑ FB trace route. What this does is to quickly tell people how trace route works, it works like any regular trace route, the only thing is we send packets from a huge range of source ports. We use hundreds, even thousands. So we send the packet with TTL of 1, obviously the rack switch replies to all of those. We send TTL 2, they hash both ways, different source ports return different answers and so on. And by using a very large range of source ports and hopefully a network device, we can actually discover the entire network topology between two hosts provided it's all DCMP. This is valuable in trying to figure out everything is there.

So, NetNORAD. The basic premise of this was, how do we find loss in the network when the network isn't telling us that there is loss there and how do we understand what the customer experience that have loss is?

So, the idea started off very simple: why don't we just ping all the hosts in the network? We started out with, ideally, a small number of pingers and responders ideally in every single rack in all of our data centres.

NetNORAD evolved over the several years it was being built. We started out just running ICMP pings, from a Python script, then we decided that wasn't going to scale very well. Then we started doing probes in TCP. That was great, what happened was it worked a little too well because all of our service owners inside of Facebook monitor TCP resets and started getting upset when they started seeing these, they said it's screwing all our metrics, we said okay, we'll go back to ICMP, we'll write it faster. That was great, but then somebody pointed back the hashing algorithm we discussed earlier, they said there's no source... we can't actually ensure we are covering all of the links in the network. We said, okay, finally we'll settle on UDP.

So we had a very simple probe format. It simply has a signature so we can tell the responses apart. We put time stamps into it and we put different classes in it.

This UDP pinger is something we have Open Sourced, so you are free to download and play with it. How do we deal with all of this data, because there is lots of it? In fact, there's a tonne of it.

So, we generate all this data and what we care about is is our network, okay, and the unit of network that we are looking at is, rack is actually quite small for us to take any action on. Usually we are looking at a cluster, which is a bunch of racks. What we care about is this cluster performing well or isn't it? So, we need to have a bit of a topology lesson. What we have here is the top, lovely round thing here is the Facebook backbone, we have multiple data centres, here I have got these. Inside one of these sites, we typically have multiple buildings. Inside those buildings we have clusters. Inside those clusters we have racks. We also have POPs, as James mentioned earlier. So we have POPs dotted all around the Internet which obviously also connect to the backbone. So pinging inside a cluster. Inside a cluster we want to build a very dense master ping. We want ping ideally every single host. We have a few pingers in every single cluster and they are pinging several per rack in every single rack that's in that cluster. That gives us a dense measure of the traffic, the availability of that cluster.

That doesn't tell us a lot. We also need to know can they talk to the rest of our network. In this diagram we have a map. We have prime ville and the other one and we have a target cluster here. What we do is, from a different cluster, we send traffic to the targets in that cluster, but different cluster, same data centre. We also send traffic from a different data centre in the same region to that cluster and we sent traffic from a different physical site that's already somewhere else in the world to that cluster. We get three different metrics: one local, from a different data centre, one from a different site. And this is ‑‑ actually gives us a reasonable amount of information. When we see data from a cluster in the same building, everything looks fine. From a cluster in a different building, things look bad, and from a remote building things also look bad. We realise, in our fabric networks we say this is probably going to be at the spine or edge of our fabric network, it's only impacting traffic going out of building.

So, NFI, why do we have to go and build something else? Well, it turned out, NetNORAD is great because it tells you these hosts in one cluster can't speak to these hosts in a different cluster. That's useful. But another principle that I'd like to stick to is, alarming should be as accurate as possible with a clear path. As we start to build these massive fabric networks we had a problem, which was we'd have clients and servers detecting loss to each other, and we basically say host connected to the bottom of this fabric packet loss to the hosts at the top of the fabric. Go find out which is the bad link or device in there. As you can see there's a lot of choices and that's a lot of manual investigation, and that's just not going to fly. So we decided we had to build something new to figure out how can we find loss and triangulate this loss inside the middle of these networks. This is what we were looking for. How do we find loss that's right stuck in the middle? How do we find something even as granular as an individual bad link in between these layers? We said if a network engineer can run trace route manually, wouldn't it be a good idea if we just ran it all the time. Let's just do tonnes and tonnes of traceroutes all the time. So we started doing that from these hosts to each other, we allocated a bunch of hosts, a couple of hosts in every cluster and started doing traceroutes to each other. We were able to build our entire topology inside our data centres, we end up with a map like this: I can numerate all the interfaces that that packet is going to touch. We do TCP traceroutes, we build a list, we do this over thousands of ports so we understand where this traffic is going to go. Then what we do is we send thrift requests, which are little TCP packets with a payload, very similar to http requests, we send tonnes of these across from the same range of ports, so if we do the traceroutes over 4,000 reports, for every thrift request ‑‑ 50,000, the next one legacy holder 50,001, until we loop around through the whole range, we send these as fast as we can, so ideally a few thousand a minute, if not tens of thousands a minute.

What happens is, again, if we go back to our hashing diagram here, because we have sent the source port and run a trace route using exactly the same protocol, source address, destination port and source port, providing the network topology hasn't changed, I know where all the links it's going to touch. So therefore, when I drop a packet, I can figure out, looking at my topology map, all the links that I think it should have touched, which ideally we do the traceroutes aggressively. It's every three or four minutes, we keep the map up to date. If I drop another request from a different source port it would have touched a bunch of different links. Now I have that that data too. At the end of the minute, we look at all the data we have sampled, we say, well, let's overlay it on top of each other and figure out which is the bad link on the network, which is the bad device. And that sounds easy, or at least when I first looked at it, I thought this is going to be easy: increment a counter for that link, look at the data in aggregate and suddenly the bad device will rise out.

So I did that and my graphs look like this, which told me very, very little ‑ that some things look slightly worse than other things but actually we didn't get a lot of great signal in it. This was disappointing.

I thought about it some more. And I realised I was missing something important. The number of links between each layer of the network is really, really important, and it's really important because that is basically the statistical likelihood of hitting a link twice is different depending on the number of links. In the example I have given here, while we only lost two packets. There is only four links you get from the rack switch up to the cluster switch, you have got a one in four chance of hitting it. Inside the spine switch, there is 48 different links in that layer, therefore my chance of hitting one of those links twice is way, way lower. So, therefore, we came up with, ultimately, a very simple algorithm. The error count for a device or a link in our network. We take the percentage of requests that we lost from source to destination, we take the number of ECMP links at that layer of the network and we take the percent of loss requests that that circuit was in and the reason for that is, the other thing we found is that we are really, really, really good at finding congestion as well as bad links because it looks like dropped packets so we decided that if you see congestion the way to try and defuse that in the data is, if you have loss and it's all of your loss is on a single circuit...

So, once we applied that formula to our data, suddenly our data got clearer. Suddenly we have this one spine switch link in the middle standing out really clearly, as this is probably the one that's the bad one. And then, well again this is all kind of science project stuff. These are pretty graphs. So, a very happy day I came in, Jose, I turned around and I looked at this graph and I said, what's wrong with this device? And he said, I don't know, there is no alarms for it. So I said, can you go look at it please? And he said, but of course. And he started looking through all of the graphs we had, the counters, he said everything looks fine. He said let's just drain this device and see happens. This is what happened. I said much as I trust my own software, I have a slightly new version that I am going to push. I pushed that new version and suddenly the loss came back, we're pretty sure this device is broken. So it was behaving very statistically abnormally. We went back to the vendor who provided us with that device and they said, oh, yes, there's these whole like million other counters you need to look at that we don't expose, you are going to need to pull those too. That's impossible, please do better. But that's unfortunately the situation you get in at the scale is you find all the weird problems and you find all the weird counters that nobody ever thought of looking at before.

So, sounds great, but actually what we built it first was a pretty solid prototype. We had lots of issues, we had issues with coverage and we had issues with how the software worked. So, we broke the software into more components. We had the server. That's the easy part because it's basically a thrift server, something that will answer a ping request with a response. We reflect the pay load back in the response and we set the same type of server settings to make sure that the packet is the same, effectively the same coming back as it was going back the the client, is what's interesting. This was all written Python initially. So we had the requester, which is basically the thing that sends all the thrift request, and we have the topology, which is the thing that does the traceroutes. So the traceroutes are implemented in a library called ZTrace. It's basically a high performance TCP trace route thing. And that was great. But the one thing, we said we need more data. We need to send more thrift requests and faster, and there is a lot of variability in that data that we weren't happy with. We decided, well, what if we first rewrite the thrift server because it was very simple in C++ and let's rewrite the client in C++ as well. The traceroutes happen once every few minutes, the thrift requests happen basically constantly. Then we looked at the graphs and it's amazing but when you write something in C++ it's way faster. This is the latency of requests, we went down from the higher percentile from 7 to 1. When we did the same on the client we quadrupled the request count, and even still the latency went down again to sub 1 millisecond.

So that was great. As has been pointed out at least twice now during this conference, people have said, you know, trace route is unidirectional. Yes, we know that. But actually, version 1, I kind of accepted that version 1 was something we wanted to get out the door. Obviously the signal we get in one direction should be okay. What we did in the initial version, we said if one host was sending traffic to the other one, we made sure they were sending it back. Half of our data was basically bogus but half of it was good, would still rise above. Even still, we said we want cleaner data, it's always better. So we made some simple changes. We said, well, if you can do traceroutes from client to server, what if we just put the ZTrace binary on the server and have an API call that says start doing traceroutes back to me from this source port and we can get the data. The client tells the server, I'm going to start sending traffic to you from this range of source ports, do traceroutes back to me and I'll collect them and process all the data together.

Suddenly, we get to see traffic in both directions. That actually made the data much, much cleaner, more than we expected.

So what's next? Well, what's next? Running on Wedge. The last time I gave this talk, I didn't have it running on Wedge. Today, I do, but it runs really badly, so we have to fix Wedge to make it go faster. The reason for this is obviously we have Wedge switches in basically all of our racks now. The problem we have is coverage, because we cannot get servers in every single cluster because some of them are full. So, we want to get it running on Wedge, we need to make Wedge performance better to do that.

The other thing we are working on is our backbone network. All I have talked about is our data centre network, it's the largest number of devices and links, but our backbone network is all MPLS and unfortunately, it changes a lot and obviously trace route doesn't work at all. So, that's something we need to fix and something we're working on.

And that's it.

Why is all this important? Well, the problem for me is that distributed systems hate packet loss. And the reason is that, as I said, it takes 1 millisecond for us to send a request across from one data centre to another. It takes 200 milliseconds to retransit that request if one packet goes missing. That's 200 X, that's really, really bad. And people use Facebook for important things. James went to some extent to say like Facebook is important. Yes, it is. People use it for critical communication, they also use it for amazing cat photos. But, there's more important things than that. Fortunately, I was hoping that no one will point out the problem with voting systems, and elections. I, being a big believer in freedom, and being a big believer in democracy, think that everything should be an election and that me and my Facebook friends should be able to come here, all of my Facebook friends, of which I have many, all of whom are absolutely real, that we can come here, we can participate in a friendly election to take over RIPE, because what's important, and this much I promise you, that if you vote for us, there will absolutely be free Martinis for everyone at every RIPE from this point onwards.

(Applause)

Questions?

CHAIR: I guess first we'll take the technical questions and the questions where will the drinks be, all right? Anyone, questions?

SPEAKER: Hi, my name is Erik from Amsterdam Internet exchange. I have a question about the UDP thing specifically. Because what I can see from the GitHub at this moment is it's a whole bunch of C++ sort code which I have to go by and one of my colleagues ‑‑ tried to do it and we failed. So would it be better to bundle it in a Debbion packages so we can get rid of the problem of dependency?

RICHARD SHEEHAN: Absolutely. There are people here who are responsible for that. The hairy Australian back there, he is the one you want to talk to. Absolutely putting it out in packages would be great and people would use it more. It uses a bunch of folly libraries and things like that. You would need to use all those libraries to get it compiled.

SPEAKER: All right, I will relay the good news then.

SPEAKER: Jen Linkova. Thank you very much. Very interesting. Question: Do you actually compare your topology, expected topology with topology you cover with your traceroutes, because you couldn't guarantee that you actually cover a hundred percent?

RICHARD SHEEHAN: I didn't cover that but we have a database where we have our entire topology and we compare it. One of the interesting things is, we are mostly a v6 network these days which means a lot of our hosts only have v6 interfaces. We have the actual network is v4, we don't have in‑house that v4 addresses on them any more. We do do that. There is two reasons. One is make sure we have good coverage. The other one is the trace route only gives us the interface that responded to the IP addresses, it doesn't give you both ends of it which is annoying because when you are trying to alarm on this stuff you want to give somebody both devices because it maybe an ingress or egress problem. We actually, when we're doing the alarming, we push all the data together from this topology information so we say it's actually the circuit and not just this individual interface. It gives the person dealing with it the right amount of data to go, well, this is the device I want to pull all these to. This is going to be the safer one to remove from service.

SPEAKER: Alex. So, how would you estimate your overhead? Because, this is a whole lot of pinging you have there.

RICHARD SHEEHAN: We do millions of requests per minute. Each request is very small. So we only do a 1K request, so everything fits inside. We do a full TCP handshake for every request. We use this because I believe in your test traffic should look like production traffic. The overhead, we have got a really really really big data centre network, no one has complained yet that I am sending a bunch of small packets. We do them on a high rate, there are very small requests. I have never even noticed them in a graph, to be honest, from each individual host that's doing them. There is literally so much bandwidth in our class networks I would be surprised if anyone ever found this, to be perfectly honest.

SPEAKER: This would work very good for static years, but when you have flows years that migrates from network. Any like smart stuff, AIs that's looking for some patterns, dynamic ‑‑

RICHARD SHEEHAN: We are really looking for the big stuff that's going to be customer‑impacting. We want to get a view of what our internal customers see. Most of the losses that I have seen in the last year have been down to like crazy hardware and software bug that's blackholing packets. There hasn't been too many which I have been genuinely surprised by, like somebody making a change or automated change to some routing policy. Those are the interesting ones. Pretty much all of the alarms we get are almost all congestion‑based because we have a lot of hoodoo traffic which pushes as much network bandwidth as it can. The hardest part is trying to figure out what's unique alarm. You have to know the difference. Pulling a device out of a congested set of four or 16 makes the situation worse so you have to know it's individual device and that's actually been, so far, the hardest part.

SPEAKER: Warren Kumari, Google. So, a while back, Petter, whose last name I can't pronounce, was doing a draft on data plane probing which is supposed to do very similar stuff to this. Do you know if he is still working on that or ‑‑

RICHARD SHEEHAN: He is still working. We know this works everywhere, we don't need any special hardware or device ‑‑ any device that runs a trace route, which is unfortunately not all of our vendors, which is a different discussion, but basically all we need to make this work is trace route and that's the reason why it works everywhere. Obviously we are looking at stuff in commodity A6, again we're trying to derive this from the outside as opposed to the network device putting networks in the packet. You are assuming the A6 is doing the right thing, which I never want to trust. I do like to treat the network as a black box for this kind of thing. We are going to look at that. We haven't gone, other than discussion, I don't think we have actually built anything yet.

CHAIR: One more question.

SPEAKER: Last question, Filippe from NETASSIST. As we know the distributed data services struggle from network failures, how do you actually estimate degradation when a node goes down and how do you ‑‑ and how fast do you fix data services degradation?

RICHARD SHEEHAN: NetNORAD gives us the best idea of, like, cluster to cluster what percentage of packet loss we see. The wrong number of requests we see gives us a good idea of of what the actual network impact is. Ultimately, it turns out that the service impact is really hard. Some of our services handle it really really well and route around problems and some of them just don't. Some of them behave, usually data services tend to react most badly to this because they tend to be in one place. Web services and things like that are a bit more resilient. I don't know if I answered that question particularly well. I don't think we really have great data on understanding the actual impact, usually the website is down and people freak out about it pretty quickly. The other question is time to alarm. Realistically, we publish all this data into our system, probably two or three minutes we can alarm on it, we're still doing manual remediation because we are still tuning the alarm to try and differentiate between congestion‑based loss and actual bad device loss. I reckon we can do it in five minutes and the rest of it would be tuning our intervals down. Trying to get our actual, the pipeline that we push the data through to aggregate it, making that smaller, I don't think we'll get it much below three‑ish minutes. That's going to be my target for next year.

CHAIR: That's it then. Thank you very much. And for the drinks maybe we can do them later.

(Applause)

BRIAN NISBET: Next up actually is the NRO NC elections and...

So, I thought, I thought Hans Petter was coming up and will introduce the process. So, the next up is the election NRO NC elections.

HANS PETTER HOLEN: Thank you. So, how many knows what the NRO NC is? I can't see anybody, so I hope you all raised your hands.

So, as of RIPE 73, all attendees here have received an e‑mail today inviting you to register to vote. Usually, we have done this with pieces of paper on Fridays, but this time I have asked the RIPE NCC to do this electronically so that you don't have to be up in the morning session on Friday to vote. So this is purely for your benefit, right?

All those registered to vote will be sent a link to the voting software on Thursday evening and you will use the same trusted third‑party voting system that we use for the GSM ‑‑ for the GM, sorry.

Voting closes at 10:30 local time on Friday morning, so you can vote on Thursday evening or during the night or on Friday morning. And the election results will be announced after the NRO/RIR reports on Friday morning.

So, candidates. So, I was going to allow the candidates two minutes to present themselves. So, Paolo Bellorini, are you present? I have been told you are not. So, instead, I will then read out the statement that he submitted.

Paolo Bellorini is a senior technician and is highly passionate about everything related to information technology. He is a kind of visionary regarding a future market development and he is interested in niche solutions. From the moment our company became an ISP he has encouraged our membership of the RIPE NCC particularly for the independent character of the not‑for‑profit organisation. His technical expertise, passion and careful dedication to the activities of which he takes charge, which is the reason I intend to... represent the interest of all members for the community.

And he has not proposed any biography, but you can find him on LinkedIn if you Google him.

So, second candidate, Engur Pisirici, are you present? No. I will then ‑‑ and I have been told he is not registered for the meeting. The motivation for the nomination is to learn, understand, participate and propagate fundamental of numbers resources Internet revolution. And he has not provided any biographic information either.

Third candidate, Filiz, you are present. Do you want to have your two minutes to present yourself, please?

FILIZ YILMAZ: Hello. I am the third one on the list. Filiz Yilmaz, I don't have any cocktails, but you know... so, I have been involved in the community in different positions so far. The first time I was introduced with the RIPE community was when I started at RIPE NCC 15 years ago as a staff member. Then I moved to ICANN as a staff member again, and now I work for Akamai Technologies, and I believe I am an active participant in both fora. In fact, you could put trust in my work. Three years ago, you elected me for this position, so I am rerunning. During those three years, I was elected as the Vice‑Chair of the group, and we have seen through the IANA transition as the highlights. I would like to continue the work, and I hope your trust still holds for me. Thank you.

HANS PETTER HOLEN: So, thank you very much, Filiz. So now you know all the three candidates. And you see the voting procedure, so it's going to be interesting to see this first time we run this electronically. So thank you for the time.

BRIAN NISBET: Thank you very much. So, the last talk in this session is from Craig Thompson on open optical monitoring.

CRAIG THOMPSON: Thank you very much and thank you, Richard, for reminding us all that we desperately need a drink. So I will try and make this as brief and interesting as I possibly can.

I work for a company called Finisar, and if you are not familiar with Finisar, there's a very good chance that the traffic over your network is being transmitted over fibre made by ‑‑ fibre connections made by our optical transceivers. So, I'm going to talk a little bit about optical physical layer monitoring and a lot of the premise of this is that, especially in a large scale network, like Richard talked about earlier, if you overlay additional monitoring information in the physical layer, in addition to all of the protocol level data, it allows you to debug problems much faster, and it allows you to commission networks much more quickly before you put them into production.

Here is a very simplified hardware software stack for a network element. In this particular case it's an ethernet switch. And the question here is how do you expose all the information on the optical physical layer through the software stack? So that can be things like the temperature that's running on the module at, it could be the receiver optical power coming into the module. We have all been used to dealing with the OEMs who have done an okay job at integrating all these hardware and software and offering some of this functionality, but we were basically at the mercy of their decision to release this information at what schedule they can support. It's getting much more interesting now in the open networking movement, there's a lot more choice. There's a lot more flexibility. And a lot more innovation going on to get more of this type of information into the hands of the end user. But the question still remains: how do you get that optical information from the physical layer up the stack to your monitoring applications or your network technicians. I am going to focus on the open switch area, and we can talk about the OEMs afterwards.

So, the premise here is taking information from an optical transceiver, decoding it in a consistent fashion, and then presenting it to the upper layers in a consistent well‑defined API. So that was the goal of open optical monitoring. This was really kicked off within the open compute project where a member of open compute and we were participating in inter‑op testing at the University of New Hampshire last year and we ran into a very basic problem, was that, in testing bundles of networking equipment, this was combinations of switch hardware, NOS and connectivity. We were not getting consistent decode, simply serial ID information. This is like part number,manufacturer dates, we weren't getting consistent decoding information, so we really kicked off this effort to standardise the decode from that information and make it available in the Linux stack.

So, what is open optical monitoring? It's basically a library that sits in user space on a Linux stack. It presents... it takes the information from the EEPROM on the optical transceiver and decodes that information and presents it as a Python API at the upper level. It provides both read and write functionality to the optical module.

It works basically on any Linux‑based network operating system. We have tested it on a number of ethernet switches from Broadcom and others. It works with basically any optical module that is standards‑compliant; it's not just Finisar. And it does require a Shim, which I'll talk about in a second, that we have also developed and released into the Open Source.

It's Open Source, it's easy to maintain, it's extensible, we encourage you to download it and improve it and put it back up there. And just from a few lines of script, you can extract a lot of information from modules or from cables.

Here is the basic architecture, if it's running inside the switch itself, it's basically interfacing to the kernel. Depending on what Linux distribution you use inside the NOS, there's a number of different ways you can access this information and so you have this Shim layer that provides that clean interface to the particular Linux kernel and all that's available. There is a decode library that then provides the northbound API to a range of software applications. We can also do the generalised case where we are transmitting that over a network connection, and say you are monitoring remotely. You can see ‑‑ I should say you can see some of the different ways that the kernel accesses this information and why that it needs to be a Shim. Generally speaking, we work a lot with SYSFS, occasionally it's Ethtool but we can decode all of that information, send it over the network and we are demonstrating that in the hallway outside, so come over and have a look.

Why would you want to do this? Why would you want to monitor optical layer information? Just like we did at the inter op event in open compute, you may just want to simply track and verify serial ID information. You want to do an inventory of all the ports and switch optical modules in your network. You can use this information to do a certain amount of topology mapping. You may want to simply monitor the health of a physical layer connection. So you can use things like transmit and receive power. Module temperature to create heat maps of health in a large scale network.

A lot of this is provided by a technology called digital optical monitoring which is technology that Finisar has developed and distributed widely throughout the industry.

You might want to do some diagnostics on that information, and isolate any connectivity issues. So you can quickly determine whether this is a physical connectivity issue or it's a protocol issue.

And then there's, this is the most exciting part as far as I am concerned. All the rest of this stuff is standards information, but we're starting to now use this open and extensible interface to provide custom features inside the optical module. These tend to be defined by the vendor itself, so in our particular case, we are rolling out new features that are available in our optical modules. But just to give you some examples: you could do a pattern generation inside the optical module and then detect errors on its partner link for essentially a built‑in self‑test of that layer of the interconnect and you can isolate your faulty module within the thousands or tens of thousands of modules that you might have in your data centre network.

We showed this at the OCP summit earlier this year. Just some screen‑shots of the monitors and the cabling and the setup you see there.

So, how can you help the effort here? Again, please download it, use it, improve it. There's the GitHub link. You can also e‑mail me afterwards and we can send it to you. We have decoded over 200 keys so far, that cover most of the data come, data centre interfaces that you use today. We are now starting to work on some of the router module interfaces and eventually telecom. Like I said, there's many Shims available to support the various Linux OSs. We have been testing the full stack on edge core switches, which we have one out in the hallway there. But you can also get an Eval board from us. We encourage you to come by and share use cases, we have had a number today and we are ex kited to hear how this might impact positively your network management and monitoring. This is now an OCP accepted project, which means that it's officially adopted by OCP, and it is being used extensively in the inter‑op testing.

Just quickly before I finish, the inter‑op testing happens on a regular basis at the University of New Hampshire. We test dozens of NOS cable module things. Those that are approved are added to an approved list. Testing occurs, we do plug tests multiple times a year but testing is actually, occurs weekly.

So, thank you very much.

BRIAN NISBET: We have time for one very quick question, if there's one. Okay. Thank you.

So, that pretty much concludes this afternoon's session. I am going to remind you to rate the talks, as previously mentioned. Unfortunately, you are unlikely to win a Martini, but I think you can order those on the Internet. Also, the voting page for the RIPE PC elections is now up and working. I'm not going to give you an URL. You can just go to the front of ripe73 [at] ripe [dot] net, there are links there to the biographies etc., you need a RIPE access account to vote. There is the party this evening, which you should all go to. And there will be more Plenary stuff on Friday, but for the moment we're going to hand you over to the Working Group Chairs and their Working Groups over the next two days. So thank you all very much.