Connect Working Group
26 October 2016 at 12 noon:
CHAIR: People thinking they can follow Address Policy from here, they can't. Should you be interested in address policy ‑‑ so, who is here specifically to avoid Address Policy discussions? Who is here to watch some informative and entertaining presentations about interconnection? At least I still got a majority for that, I am happy, thank you. Somebody running off to catch Address Policy. All right. It is now 11:O2 in the UK, which is the normal time for this presentation is two past noon here in Madrid, but they are time /STKPWHROPB confused. Welcome to the connect Working Group group from Madrid with love. It's the second most favourite Working Group of this meeting, the most favourite of course being NCC services later today. We have a jam packed agenda for you today, and we might even have some spare stuff in case we run through too quickly.
So, here is the agenda. Ameal has been volunteered by his boss to scribe this session so thank you a.m. eel. Any comments on the agenda? No. I figured as much. You have all read the minutes of the previous session, haven't you? Show of hands. Well done. So I take it we are all in favour and approve these minutes? Going once, going twice. So thank you from the RIPE NCC for carefully drawing those up.
For the revs the agenda we have got a couple of presentations from the IXP tools hackathon that happened this weekend, a presentation about illegitimate source at IXPs, and then Benedikt is going to talk about scaling BIRD route servers and Leslie a summary of the panel discussion on the challenges for small IXPs and then we will have some feedback about what you thought about the session and we will have some lightning talks and feedback some more, that was me rushing through making slides. And we are going to close the session in time for lunch, I hope.
So, without further ado, we are going to move to agenda item number 2, and we are going to start it with Pinder, which as I was ‑‑ I was one ‑‑ I was part of the jury for evaluating the projects during hackathon and a mashup between Deering DB and tinder was simply too good to pass up, here is Matt.
Matt: So this is an idea we had at the hackathon which was essentially trying to make communicating with peers a lot easier so we considered it peer speed dating to a certain extent. Many of you will go to events like EPF and GPF and meet high value peers and discuss with them. This isn't about that. This is about the e‑mails you tend to get which are essentially by smaller networks that ‑‑ and the reason why you generally say no is because it's 30 minutes to 60 minutes worth of effort, they will normally send you an e‑mail possibly from a gmail.com address rather than corporate and say something like peer with me and forget all their details, you will look them up on peeringDB, whatever. I think this isn't good enough. I think there should be something that allows peering to be facilitated easier and that the reason why you reject peering should be it's not a good fit as opposed to its too much effort.
So, okay, how do we fix this? Well, we have decided that if your data is already in peeringDB, why don't we just integrate with it so introducing Pinder, swipe right on a new peering relationship. Okay. It is Tinder for peering. I apologise. So, we have got a brief diagram as to how Pinder works. Essentially you are going to submit a peering request to somebody and three actions happen: Reject it, accept it or contact me, I need to talk to you about this. Maybe you don't want to peer at a particular IXP and, you can roughly get from that what we are trying to do.
So, when you say contact, one of the important things is that you are not saying no. Too many times people expect you doing yeah straight away, well, maybe they haven't got a valid AS macro so to improve the health of the Internet you want to register these things in peeringDB and that will give you a way of contacting them rather than just leaving this request stale for a long time.
So, let's have a quick look at the app that we have built. We have figured all good apps come with shiny 2 ‑‑ this is web 1.99, you input source and destination ASN if it was implemented for real you wouldn't have the source ASN, it's yours. You type in a couple of numbers and it shows you all the exchanges. And you find all the exchanges that you have in commonality at peeringDB, so we have just mocked up that that these are the all the exchanges could you possibly peer at and your request has been sent. As simple as that. If you are only a small peer you don't need more than that. You will then see that your request is waiting on the other side and you can press the magic go button and it will say these are the details of what wants to happen, ACME is peering with EvilCorp and you can press the heart icon and it swipes right. That has been sent off. It's in progress, you both go and configure, everything is done. But we are network engineers, some of us like to whole command line thing when want to use Perl and whatever so if what if this is a simple idea, it can be done in a few minutes. Why not put a CLI on it, cool. You can see all the status of your requests, what is finished and what is pending, all that. Sure. It's a CLI. Okay, well that doesn't really give us a full view of what happened at the hackathon because this is quite easy so far. So what about some Ansible integration, why can't we automatically go and do these things, we can do that. Given this simple diagram your network may be more complicated than this. You have got two peers at an IXP, we can see here from get request there is an outstanding request and we can see here that the Bravo network wants to peer with Alpha and we say okay I prove that. Well once we have approved it why do I have to log into my router and configure it? So we just say deploy it. It goes off, deployed and it puts in the right syntax there and the very last stage here called confirm peering will actually talk to the Pinder API and go and say I am ready, I am configured. So, if the other side goes and does the similar things you will see they have done it as well. Both are now finished so you will see in your get requests that both sides now say is ready, they are both configured, the peering should hopefully be up. If something is wrong like an ACL you can get their contact details and contact them them individually rather than having to do anything crazy, and yeah, everything is now marked as finished, you have said complete requests, it's done.
So, where do we go from here? We have got all the code of our little example up on GitHub and you can see it there. There are possibly a few concerns that need to be taken into account for this in terms of data security and whatever but hopefully either peeringDB or somebody similar will be able to take this idea and run with it. Once again, it is not a tool that is going to be used between two big networks, they will want to have commercial terms or whatever so they may not be interested in this. However, certainly for the Googles and Facebooks and Microsofts who have their crazy tools you have to enter all the information into, that is a lot of work for certain people and if they can integrate this directly with peeringDB you will massively decrease the amount of communication that needs to happen, it can all be standardised and everything just be rolled out. Because it is web 2.0 we have got the whole doca infrastructure in there, there is a container file and do whatever you like with it. There is a bunch of infrastructures there in there, if you want to use the Ansible template and play around with it and hopefully when somebody brings it live people use it. That is what I'm hoping, at least. So, from that, we have everything pretty much written up on our little Pinder website, yes, that is a real domain, thank you very much. And that's about it from the Pinder hackathon project. Any questions?
REMCO VAN MOOK: Any questions for Matthew? Stunned silence is the accurate description. Matthew, I think that was excellent, thank you very much for doing this, and ‑‑
Matthew: We had a fantastic team, if you come along to the next hackathon you will have a great time, honestly.
REMCO VAN MOOK: We move to the second presentation from the hackathon, which is interestingly called the remote peering Jedi, a portal into the remote peering ecosystem. I am not sure if it includes time travel.
SPEAKER: Not yet. Thank you very much, I present project on the remote peering ‑‑ so the motivation is that both researches and operators haven't yet know who peering remotely at IXPs so they can inform their peering decisions and to do it we tried to replicate paper that was present add couple of years ago, we tried to automate it over RIPE Atlas paths, so we take the already paths and cross IXP by getting the prefixes of the IXPs from peeringDB and calculate the RTT between the of the IXP and that is a bit noisy but since Atlas has so many paths collected ‑‑ so we get a large number of samples that allow us to filter out noise with ‑‑ half percentile and take the median, if it is over 20 milliseconds we say that it's definitely remote. Here is the results. We can see that both 60 to 75% of the are less than 10 milliseconds. We just checked five large exchanges for the hackathon but we plan to extend it. And it's interesting for instance, to observe to Los Angeles which has that bump, much later than the other into the LINX. The remote peerings are about 12 to 18%, and then we have a grey area of between 10 and 20 milliseconds, for which I can talk at the end of the talk. To validate our results we collected data from the IXPs and we take the observe agree with our own observations and almost all the cases they agreed so we can be fairly confident that our results in terms of RTT are correct. And we checked the peerings that I ‑‑ I used also what is large CDNs and a bit surprised by this result because apparently they are used to access with Cloud and Twitter, there is a result although some remote peerings are very well interconnected in their local Eco system the CDNs are not present, so they have doing all the way to Amsterdam or Frankfurt so maybe this is a point for the CDNs to be more supportive of the smaller IXPs. Some of the remote peers are very aggressive, the usual suspects. We have some of them peering in all the IXPs that we checked and that may have some implications on the actual multi‑homing resilience, some we observe if we checked just layer 3 presence.
And also this work gives us some opportunities interpret better the data that are recorded in by the ASes because many of these remote peers claim to have local presence in the IXPs that turned out they are remote. Between 16% and 25% actually in peeringDB record that they are located in the local facility, which is around to, I don't know, fast fingers or because they want to appear more appealing or because they have a different interpretation of the field and they may not mean exactly physical ‑‑ private interconnections there.
So now the problem not just to take more peers but also to find their location, and we started with some already available techniques DNS based geolocation which are both ‑‑ and open IP, which are simple. But the problem is that the coverage is a subject of the IPs that we had in our dataset and to fall back we didn't want to go directly to geolocation databases which are widely inaccurate, so we sort of created a new approach which is we call the presence informed peeringDB RTT geolocation. And the idea here is to make the progress space smaller by taking the locations where the owner of the IP that we want to geolocate is present and try to ping the IP from the Atlas probes in the same locations, and then just get the city with the lowest if it is ‑‑ if these delays are lower than 10 milliseconds and this can be implemented exactly because Atlas has so many probes that the platform covered all the major cities. So here are the results for remote peer that we observed in ‑‑ if you cross‑check the in Riga in Latvia. In total they had ten cities in peeringDB at five pings from five probes in each city and I think the results are telling. We have a demo where we show where and is the most peering in some major IXPs. These are real data from RIPE Atlas paths last month. So we have a points in the map, someone can easily see both the AS, the IP and the location and we also have the ‑‑ median in case may be interested and based on that. And we would change the ‑‑ you can see some of the pings remain in the map between the different IXPs which indicates what I talk about earlier, the usual suspects who peer everywhere remotely. And this is ‑‑ this is the distance from the main connect providers. And here you can see for any 2 how the peering, remote peering on the side of the Pacific ocean so much higher for these guys. So this is public, we plan to extend it with more IXPs and more data and make also the old data available, not just with the map, in case someone wants to download them and cross them. We are already working on extending the system so we will try to keep you updated to a mailing list and yeah, thank you very much, also just a point about the hackathon, there was a great experience, I really enjoyed it as well so I am planning to join next year if it's possible.
REMCO VAN MOOK: Thank you.
(Applause)
Any questions or comments? Stunned silence again. I think that is a compliment. I thought it was very good, thank you very much. I hope to see you ‑‑ by the way that link is live and you can play around with it, it's pretty cool. Up next is Franziska, I think. She looks scared. She is going to talk about old stuff at IXPs.
FRANZISKA LICHTBLAU: Hi. I work as a PhD student and I will today talk about illegitimate source IP address at Internet Exchanges. This is joint work with my colleagues and that is basically what I have been working on in the last couple of months. So, let's get started. If we talk about illegitimate source IP addresses this could mean a lot. For us we look at the packet level, we say basically we have source addresses that are not valid within the scope of the public Internet, which in essence, for us means we look at traffic that is probably intentionally spoofed, we see lots of internal traffic that is leaked by mistake and we are on the Internet that is lots of stuff we don't know why is actually there. So, why is it interesting? I think we can keep that pretty short. I said spoofed traffic, we all know DDoS amplification, we as researchers can go and study that, we might be able to help you guys come up with some mitigation strategies to work around and solve these kind of problems. And maybe it's also interesting to some people to see what we can see about their infrastructure if we look at one of our vantage points. And we will talk about that later. Traffic that is not necessary to be there utilises bandwidth and bandwidth doesn't come for free.
So, if we say illegitimate traffic we have to say okay, we need to put stuff in categories and we came up with basically three categories to look at this traffic to try to understand what is happening there. Basically the first we named Bogon which basically includes traffic that is from the RFC 1918 IANA, reserved multicast and things like that. We have the class unrouted which is traffic with source IP addresses for which we see no announcement on the global routing table, I will come to that later. And the most interesting class for you is class invalid which is traffic that is not BCP 38 compliant, traffic that is sent by network that we can find no evidence that it's responsible for the corresponding prefix.
Just to do the quick scientific wrap‑up, what do we do and what do we not do? We know a lot of really good projects that are looking already into BCP 38 compliant, everybody here knows the spoofer project run by Kada, where you can send probes to networks and see if they are BCP 38 compliant or not. Our work is different because we provide a passive approach to measure how much traffic that is not compliant with these kind of policies, is there, and we can also try to provide some insights into how this traffic looks like, what are its characteristics and how much of it is actually there. So, now, we actually to start on identifying that categories, I just introduced you to. So the first two are pretty straightforward, we have the class Bogon that basically constitutes of a static class of prefixes. We can look them up in the RFCs and the provided lists, so everything when we see a packet that has a source address that is part of this prefix list we give it the tiny label Bogon and put it in that category. For unrouted it's complicated, we look into a vast amount of publically available data, we construct a list of prefixes we see announced and if we see a packet has a source IP that does not match this list of prefixes we name it unrouted.
So, I said a lot about routing information, in order to get the best possible view on what is actually happening, we utilise all the routing information we can get our hands‑on, so the RIPE RIS collectors, route views and the martian prefix list as provided by Tim Conway.
So, how do this classes I just introduced actually look like. It's pretty straightforward, the big red bar is routed, we know that roughly 70% of the whole IPv4 space is actually routed. We also know that the Bogon prefixes cover roughly 14%, and that leaves us with ‑‑ that leaves us with the blue box that is of unrouted and we actually find roughly 3 million /24s for that. So, let's get to the third class, which as I already said, is that that it is not BCP 38 compliant, how do we actually try to identify that. Here we have a typical set‑up, we have ASes that are connected to each other via public Internet and we make one assumption that is important: We say if an AS is announce ago prefix we make the deduction that it's also a valid source for traffic of that prefix. That being said, the first case that is described here by the green arrow is pretty straightforward, we have AS A announcing a couple of prefixes and we go and construct for this AS a list of prefixes we see announcements for. If the traffic that we see matches it it's pretty good and gets a green arrow, that's fine. We start to cover more complex relationships like upstream so AS CB and D are announcing their prefixes so we extend the list for AS A to cover as many of the prefixes is responsible for as we can. And we do that for every AS we see on the public routing data.
So, what is actually invalid. I talked about what we considered valid, we look at the traffic on a packet level, so if we look at the source IP address and find something, find traffic on our vantage point that is being sourced by ASA where the source IP does not match any of the valid prefix lists, we constructed, we tag this traffic as being invalid.
So we are researchers, so everything we do has limitations and we are in the measurement and in these terms categorising, also we have two kinds of limitations, the most important one for us is we need to minimise false positives because we don't want to falsely identify somebody as putting an illegitimate traffic if indeed it is not. So, we know that we cannot get a full picture of the bull BGP state and we have no chance to see anything that is not puckically so we don't know about private interconnect but we try to minimise that. So also do not capture all the illegitimate traffic so we get false negatives because for us we do a vastly over‑estimation and try to be very conservative in our matrix so we need to be an AS somewhere on the path not as a stop AS, to be a valid source.
And keep in mind, this approach relies on the fact that it's totally off‑line process, we pre crunch a lot of data so a lot of number crunching and memory is involved here.
So, now we actually start with the IXP fun stuff. Now we have our method and we worked on that on flow data that was provided by one of our cooperation partners, we worked on five works ‑‑ five weeks on uninterrupted IPfiction data from the beginning of 2016, it was obviously sample traffic but still reRyably assured we can get some ideas what is happening. Only look at IPv4. How does that stuff look. First I give you a high level overview on our three classes. If you have a look, the first column shows the classes, then we see how much traffic we actually found for each of the classes in one week, and then it shows that in terms of fractions of total traffic. If you look at the numbers they don't look pretty high, the highest fraction is for class of invalid where we see that roughly .07% of all bytes on this vantage point were actually of that class, but if we do the math and add all these numbers we see 600 terabytes that have traffic within our study period all three classes in one week, which if we applied the usually factors we know from amplification attacks, amounts to a good amount of traffic. So let's have a look thousand traffic looks on timescale. This is a time series for TCP and UDP for one week, every data point corresponds to one hour, on the top row you see nice behaving, that is the overall traffic we see at that IXP at the time, we have nice time of day effect, that is what normal traffic looks like. What we were a bit surprised about is the orange line that is traffic with source address of the RFC 1918 range which pretty behaves in the same manner as the regular traffic does which leads us to the conclusion that it's probably some misconfigured NAT boxes that actually leak some of the customer traffic. The rest of the stuff, the detail is not that important but it behaves very spiky, it's not as constant so that is usually a good indicator for irregular behaviour.
So, now we need to get a bit deeper. We had an idea how this stuff develops over time and if I speak about spoof traffic everybody wants to know about amplification, we need to look into UDP destination traffic. What I show you here is the regular UDP port mix. I show you the first 10% in this graph because UDP is very high fractured so we can't say 60% of the traffic can be attributed to one protocol but it's very, very highly mixed so you see the usual stuff: Steam, Quick, Bit orange, video streaming, that is how the regular traffic looks. And compared to that, our class of invalid, so the potential spoof traffic, shows a bit of different picture and I have to highlight here that on the first bar graph I show you only the first 10% and the scale on the other one is actually for 30%, so we see that the invalid UDP traffic is actually dominated by DNS and NTP traffic which are one of our usual amplification culprits, so that is what we see there. And now we need to ‑‑ we had an idea on what is in this traffic, and now we are interested in who is sending it.
What I show you here is every coloured bar is the fraction of traffic of class invalid that one IXP member contributes. And what we see, that we can actually attribute more than 80% of this traffic to only three members. So, maybe if we go and educate some of the people we would be able to do something. But let's go a bit deeper into the peer member categorisation. Now we need to explain to you a bit what I'm doing here:
We have a scatter blot here so every dot and cross corresponds to one IXP member and we categorise them. We categorise them based on the peeringDB and a lot of manual verification for over 700 members so you will see like tiny red cross is hoster, green stuff are ISPs and the purple triangles are content provider for example. What we did, we actually sorted them by traffic volumes, so if you are on the right of the plot you are high traffic carrying member, if you are more on the left side you don't have that much traffic. And what we did on the other axis, we sorted them by fraction of unwanted traffic in ‑‑ or he will I think the MAT in terms of total, if we see someone in the lower right corner it's a high traffic carrying member with almost no illegitimate traffic. If you are in the left lower corner you are a very low traffic carrying member but don't have that much illegitimate traffic. If you are on the other side on the top left side, you are a low traffic member, and you will have a good amount of this kind of traffic.
So, let's try to get an idea what we actually are seeing here. This is for the Bogon traffic so RFC 1918 and consorts, you will see we have the axis less than zero and zero and we see in this class the majority actually does not leak anything, which is good. And if we look into that traffic we see that is mostly leaked since so we have probably misconfigured NAT deployment which also concursing with the finding that these are mainly low traffic ISPs and small hosters.
In the next slide I will show you the same plot but for the potential ‑‑ classes that contain potentially spoofed traffic. Here the picture looks a bit differently. You have to note again, we have more members involved in this class than in the Bogon class, we actually have nobody in the top right corner so we don't have high traffic carrying members with high fraction of this traffic. We still have some clean members, and what we see, we have clean members with high amounts of traffic and the biggest cluster you see is on the left‑hand corner, so what we see is we have lots of low traffic ISPs and hosters that have a good amount of their traffic in these classes. That leads us to some conclusions. We see this filtering is not deployed everywhere but we can do it right, large networks can do it right which was especially for me as a researcher good to know because if I go in that community, yeah you will never deploy that correctly, it will never work because for large networks it's almost impossible to do because they have got so many prefixes. We can show actually you can do it correctly. Lots of small networks like proper filtering and only a small amount of members actually contributes a good fraction to this class, and that leads me to some conclusion that actually, the efferts of this community is already pushing to educate people on how to configure their networks correctly, that it's important to do that and that they are actually able to do that, is a really worthwhile effort. And I would be really happy, I talked to a lot of people who came up with some ideas, if could you come up with some good ideas that we can actually implement and maybe where we with our measurement researches focus perspective can help you guys to come up with some strategies to make the Internet a better place. Thank you.
(Applause)
FLORENCE LAVROFF: Thanks for this presentation. Anybody has a question? Awesome.
SPEAKER: Andre. Internet Society. Thank you very much for this interesting research. My question is: Have you compared this results you saw with spoofers statistics, in particular what is the percentage of, I think you surveyed like 700 members, 700 networks, what is the percentage of those networks that actually meeting spoofed traffic?
FRANZISKA LICHTBLAU: We did not do the full comparison, to be honest. I used the results of some of the other spoofer projects to verify I am not miss classifying a good amount of members because I don't want to attribute false things to people but we were not yet at the point where we would do that but that is actually one of the steps we want to take to get a bit more confidence and maybe to merge our results, but thank you for the suggestion.
SPEAKER: That would be very interesting because spoofers are just 70% of AS or roughly 70% is not spoofable and some people argue that those results can be skewed. It would be interesting to compare. We are biased by vantage point.
SPEAKER: Regarding false positive, how did you tackle the situations when you have a multi‑home customer that emits traffic through provide they are a doesn't announce this address space?
FRANZISKA LICHTBLAU: That is what I say. Let's say we got some bigger cases where we were able to manually filter them out and see if we had deeper look at the routing relationships sometimes also on some of the peering relationships we got word of, then we could attribute that but we can only use the information we can see publically.
SPEAKER: Thank you very much.
FLORENCE LAVROFF: We have time for some more questions? Anybody else? Well, I think that is it. Thank you.
(Applause)
And let's jump to next presentation, which is from Nick about Flow Telemetry.
NICK HILLIARD: Hello everybody. I am from INEX. And I want to talk to but Flow Telemetry, specifically DDoSes and what IXP participants can do.
So, we have a problem at the moment with IXPs because there is an awful lot of DDoS traffic hitting the Internet, an awful lot of it is spoofed and we are seeing an awful lot of DDoS flows hitting IXPs and it's very difficult to trace them back to the sources because if the traffic is coming in, it's spoofed IPv4 addresses, the ISPs have not real visibility and no way of detecting, well subject to what Franziska was talking about previously, it's very difficult to detect what constitutes DDoS traffic. No high end routers ‑‑ that is not quite true ‑‑ some high end routers support sFlow data support from the router but for those routers which are, which only support NetFlow, the Net flow information will give you the higher level information about the packets coming through, but you won't get the Mac addresses. The route erring vendors don't seem to want to implement this and this is a real pain.
And there is a problem, that even if the IXP infrastructure collects sFlow information, it doesn't export it. And there is a reason it doesn't export it, and that is that there is a privacy concern and that boils down to the fact that when you have an sFlow packet, it's a container format so you have got multiple sFlow records in a single data gram and get sent from the switch to an sFlow collector and there is no software out there at the moment which will actually split up the individual sFlow data gram noose sFlow records and then filter on those records and then send the records out. And obviously, you can't just send all of your sFlow traffic or at least awful your sFlow telemetry data from the IXP central sFlow collector system to an IXP participant, because let's face it, Microsoft is not going to want to have Google seeing what they are doing and ACME and Limelight would feel pretty uncomfortable about each other seeing what sort of traffic flowing over so that doesn't work as a proposition.
So this is ‑‑ this is a visualisation of the problem, that we have an sFlow data gram here, you have got a header and an agent IP address and then a whole pile of sFlow records in there. And sFlow is an ingress, mostly an ingress technology, so you have to look at the source Mac address when the packet is coming in off the port.
In order to solve this problem, what has to be done is that that sFlow data gram has to be demuxed and then remuxed into a set of new sFlow packets and then they have to be filtered and sent out to the destination sFlow collector that you want to send them to.
So we had a look around and couldn't find any software that did it. SFlow tool, which is the reference implementation from inMON, that had a very good for dissecting but wouldn't reassemble them and we had a look at a couple of other bits of software as well, but PMacct looks like a really good candidate and we tried to see if it was going to work but unfortunately it didn't, so we sat down with Paolo and had a quick chat with him and he said that seems like a good idea and he has built this into the framework. So this now fully supports sFlow demuxing and limited remuxing but most importantly it supports filtering.
Super simple to configure, this is the main configuration file that you would need to implement. There is only a couple of interesting things there that really are important. You have a pre‑tag map which takes the sFlow records and which assigns each sFlow record a map ID and then the receiver's list maps that IP to a destination collector ID. You can see that a bit more clearly here. So this is a slightly modified live example that we were playing around with to pilot it, so if the sFlow system sees anything with that particular Mac address, with source or destination Mc, it will assign a tag of 32 and then, in the ‑‑ sorry, there is a slight typo in the presentation there; the next file should be receivers dot LS T ‑ T ‑‑ and that associates tag 32 that destination address.
We have built a support into this for IXP manager, it's not rolled into the master release but into the IX live branch. Completely easy to configure; you just click plus and then you put in the destination IP address, the destination port, you live on "add" and then it just appears. This is the AS 112 service running as INEX and a screen shot we did earlier. That is very useful. In order to hook it all together, you also need the IXP to export a map of all of the Mac addresses used at the exchange. IXP manager does this already and it's already standardised in the you're IX format export schema. IXP members need to run their sFlow collectors. But that is actually okay because most NetFlow collectors will also support sFlow.
Here is kind of a rough idea of how it works so, sFlow packets hit the sFlow collector, which demuxes and remuxes all of the information on a per Mac address out to each of the IXP participants, who can then poll IXP manager or whatever the provisioning system is to pull the mapping between the Mac address and who owns that Mac address at the IXP.
So, the result is, IXP participants can now see exactly what is happening to their IP traffic flows at the IXPs. There is no privacy issues, this is just an extension of what you might see otherwise with NetFlow. If you are having problems with spoofed traffic across the exchange you can now actually see where that traffic is coming from. It's fully live, the data is going to be no more than a couple of milliseconds old, so that means that it's suitable as an input into your traffic management systems. It's currently at pilot phase at INEX, which is to say that it's live, it works, it's not the final configuration that we are going to roll out but it does actually work. And the source code is available on GitHub. Thank you very much.
Any questions?
(Applause)
FLORENCE LAVROFF: No questions? Ah.
SPEAKER: Hi. Eric from AMS‑IX, so what do you think how scaleable would the decode would be? I mean, we are using B Mc, I have tried my hand on decoding the sFlow data from coming out from AMS‑IX platform, and what I have seen is well, I really need to ‑‑ quite a lot of instance to cope with all the incoming sFlow coming in, so what is your take on this?
NICK HILLIARD: Scalability. Paolo and I did a small amount of work and he is actually standing at the back and can give a better answer on this, we hit it with the full INEX sFlow feed, which we understand even though the traffic volumes are hugely different, is about one‑tenth the number of packets per second as the AMS‑IX, what do you call it, sFlow feed is. And we assign that a mapping list of 1,000 entries. It was busy, I think it pulled about between 8 and 10% of the CPU and the CPU was a virtual CPU running on a reasonably modern system, so we think it will, it should scale reasonably well, it should be able to handle large IXPs but the advantage of it is that you can actually take a full copy, if you have a single sFlow collector you can take that and take a copy of that data and send it off to another machine and do the remux in there, or alternatively, if you have got multiple feeds coming into your sFlow collection system, you should be able to split it off and you can actually distribute the load among multiple servers so there is ways of getting around the scalability issues.
SPEAKER: Thank you.
SPEAKER: Philip, net assist. Just I would like to say that your idea of getting the thing done is very good, and tools are used is completely fits into the problem. I would like to admit that, yeah, sFlow is the solution that you chose and it's ‑‑ it's the only solution that works in such amount of traffic. You don't need exact information about flows. What we see exactly in the attack, we have distribution of traffic which actually shows you the attackers easily. What should actually admit that scalability of such solution is quite easy and I may help you with such a project, but what I would like to ask you, and this question I almost answered by myself is the accuracy: How accurate is the solution and maybe give me some numbers on it
NICK HILLIARD: Okay. Well, so, we are tagging on source and destination max so the information that is going out to an IXP participant feed is going to include those Mac addresses, and parallel to that, we can give a 100 percent accurate map of who uses those MACs at the Internet Exchange because assuming the IXPs has up‑to‑date data, I don't see any reason why it couldn't be 100 percent accurate. But then on the IXP participant side, it's up to the IXP participant to then do the am lis and to make the mapping between the Mac address and the source port that the spoofed packets are coming in from.
SPEAKER: Thank you very much. And thanks for your project.
NICK HILLIARD: Thank you.
Steve Nash: Arbor Networks, as the collector with, we have talked about this problem some time ago, Nick, and I would love to work with you but we would need one of the members that is also an ARBOR user to be interested in working together to make full use of this feed
NICK HILLIARD: Great, cool, thank you.
SPEAKER: Paolo. Into the question. Reply to Erik from AMS‑IX, is that it's good that scalability is a concern but it should never be a problem so whenever AMS‑IX wants to speak to me will be very glad to hear. Thanks
NICK HILLIARD: Okay. Thank you very much.
(Applause)
FLORENCE LAVROFF: Thank you, Nick. And our next presentation and on our agenda is a presentation from Benedikt Rudolpg about BIRD route servers.
BENEDIKT RUDOLPH: Hello. I am from DE‑CIX, and I am in the research and development team there. Today I will talk about scaling BIRD route servers and our add ventures into that field.
You probably all know BIRD, the routing daemon, it's quite popular and it can speak BGP, this is one of the main reasons why it is used in this community. It's written in C, OpenSource software and most important, it's widely deployed for route servers at IXPs. There is many mid‑size to small ones using it and even large ones like DE‑CIX that are using BIRD.
It has a flexible configuration file syntax, you can configure it in almost any way you desire it to run. And it's known for reliable and stable operation.
So, why look into BIRD and make it scale: Well, we experienced some problems from operation and most importantly, we want to avoid what I call the so‑called spiral of death syndrome. Not that we experienced it in our production set‑up but in testing we discovered that when a large number of peers goes away, for example, due to a physical link failure or one peer announcing many prefixes at once, because BIRD is single it takes a long time to reconverge, and also during maintenance, we have processes in place to manually disable every protocol when we have a plan maintainers and then reenable it in order to limit the load that the BIRD process has to cope with.
We also want to help, to improve the situation, and to make BIRD scale on the unused CPUs. A single thread application just runs on one and there is three others idling on the same machine, why not use that resources? And we may also reduce the load on the individual BIRD process when we distribute the peers among multiple processes, and in the end we may even serve more peers in total at the route server. And the solution that we plan should be easy to test to deploy and maintain, the clients should not have to do any alterations to their configuration or we as people looking into it do not want to alter the BIRD source code as of now. So, what is the exact problem? I want to rephrase it: We want to make multiple BIRD processes behave like a single one, like the configuration you already know. We want to share a common routing information base, a common master table, we want to calculate the same BGP best path as with the traditional set‑up, and we want to appear on BGP level exactly the same. And of course we want to share one IP address on multiple processes.
We also want to load balance the incoming BGP connections and share them among one, two or many processes.
So what are the building blocks of the proposed solution? We set up a private subnet on the machine running the route server with private IP addresses, this is one part of the solution. Then, we link the master tables among multiple BIRD processes, we need to do a full mesh there for reasons, and we use EBGP with the add‑path feature to talk to each other. We tried other things, we tried iBGP but it fails in the best path election because take lower precedence, we tried EBGP but then we experienced path hiding in some situations. And the third building block is we want to balance the BGP peers based on their IP subnet so we take the peering LANs and split it into smaller subnet and assign those using destination to NAT to the one public facing of the IP server.
This is a slide that gives more detail on the load balancing. As I just said, we use destination NAT to make a mapping from incoming BGP connection to the actual BIRD process in the private local ‑‑ private loop back network. Of course, there is some overhead in the set‑up. It's not clean or nice solution but it's a practical solution. We have N copies of the master cable and for each new best path we receive we send out N minus one updates to the other BIRD process, this is because of the full mesh set‑up.
So, now to the benchmarking or testing path. We used the BGPERF framework and designed our test in a three step process. At first it is a Python framework and at first we do the initiallisation step. And we generate config files for the software paths that are involved in the later testing so we use ExaBGP to simulate the peers, and we have route collector or route monitor that monitors the BIRD instances ‑‑ in the second set‑up ‑‑ second step, we set up everything, we three docker containers, one for the route server on the test, the so‑called target, one for the monitor and one for the tester and then the tester, the ExaBGP processes run. So and we execute the test, we bring up all the IP ‑‑ the interfaces and during the test we lock the time ‑‑ log the time, CPU and memory usage. We repeat each test three times for one two and for BIRD processes and for a varying number of peers. And yeah, this is basically the set‑up and now I will present you the results that we observed.
This slide shows one BIRD process. So, we have the time on the X Axis and when we start the experiment, the CPU goes up as more peers are coming in. The diagonal lines show the amount of peers received by the BIRD process and the amount of memory, and the take away from this slide is basically the experiments ends at the yellow line when the route server has learned all prefixes. For one process and the stated number of peers, it takes 165 seconds. Now we add two processes to the set‑up and we see the scale on the X Axis is the same, we can reduce the time until all prefixes are learned by the route server a fair amount. It just takes 150 seconds with two BIRD processes. And when we add four processes to the set‑up, the time to learn all prefixes is even reduced further.
So, these are just three examples, and there is many more set‑ups that we ran, automated, and this is an overview for all the tests that we ran with one process. You see for, on the X Axis we have the time again and on the Y axis we have ‑‑ no, wrong. On the X Axis we have the number of prefixes in the current set‑up and on the Y axis we have the time it took to learn all the prefixes. So you see for small set‑ups with very low number of prefixes the results are quite inaccurate, they are varying to a large amount, this is also visible, but as we ramp up the number of prefixes, the execution time follows a polynomial so quadratic scaling, super linear scaling and this is one of the problems with the set‑up: As you you add more peers the effort you take to do all the calculation also increases. The yellow line is a fit curve that we put through the raw data and we see there is a Polynomial relationship. The same for two processes, what I already told ‑‑ we also see, we see that the variation in results diminishes, we get more stable results with two processes. However we cannot test the largest configuration to a limit in memory. And we also see the quadratic relationship. And the same for four processes. We also observe the anomaly for small set‑ups and small numbers of peers but for the larger set‑ups we observe the same way of scaling. And with four processes obviously we even consume more RAM and cannot /T*ETS second largest set‑up.
I will sum up all the tests in an overview. You may be interested to see what the actual gain is of using the described set‑up. We compare the execution times of all the set‑ups we tested, so in this graph three different colours, one for one process and one for two BIRD processes and one for four. Lets look aside for the small set‑ups, they are not in scope anyway because we want to look how BIRD scales for large set‑ups. And when you look at the average times of execution with two BIRD processes, you see they can be as much as 60% lower than using just one BIRD process. And it's consistent. So even if you look at the arrow bars you see the red bars are lower than the light blue ones. However, using four BIRD processes due to the overhead their there are situations where there is not more efficient than just using two processes.
So, this is already my last slide, and I will try to sum it up for you:
We see that the multi process set‑up can improve the response under high load, but there is also a cost: The increased usage of resources. We have quite moderate resource usage with two processes, so if you remember, the individual graphs with two processes we use about 8 gigabytes of ram and with four process it is doubles to about 16 gigabytes of RAM. And we have a speedup that we can observe, but as we take more and more BIRD processes this is equalized by the overhead, so the speedup that we observed there is already overhead involved. For example, we need to replicate the contents of the master table and have an internal communication overhead. On the loop back we look back over TCP, so handshakes and so on, all the overhead is in the set‑up. There is probably, you could improve on this, for example, to use just plain U /EUBGS sockets for the communication of the local BIRD processes, they are running on the same machine anyway. And to conclude, I will also give some ideas for possible improvements:
There is a possibility to enhance BIRD using a simple multithreading model. As, you know, parts of the configuration are very modular, the BGP protocol itself is well‑defined software instance within BIRD and maybe could you split up those processing using multiple threads. Also, we like to note that our peers are quite synthetic so they all have the same amount of prefixes and there is ways to improve the set‑up and make the peers more realistic, obviously. And this is also one thing that we want to do in the future. And if you have any feedback on that, how we could make this testing and the set‑up of the test environment more realistic, please talk to us. And we noticed that when we start the experiment, it all goes well, the prefixes come in and then the BIRD process takes a couple of minutes or a couple of seconds to wait. So the CPU usage drops for, like, ten seconds and as it is visible to us, nothing happens, but then the experiment continues and eventually the processes learn all routes so we need to look a bit more into what actually happens there.
And right now, all the benchmarks are executed on one machine using separate docker containers and possible improvement for the future would be to shift the load generation part to another machine, yeah, but we made sure to provide enough resources so we had a pretty big machine with 8 CPU and 64 gigabytes of RAM and we never ran out of resources.
And we might even repeat those experiments with eight BIRD processes to see what the overhead will be.
Okay. This concludes my presentation. And I am happy to take questions.
(Applause)
REMCO VAN MOOK: All right. Any questions for Benedikt? I see people running towards microphones, this is very unusual.
SPEAKER: Peter Hessler, on your slide comparing the execution times I noticed, if you go back just one more to get rid of the highlight, yeah, I notice add very interesting trend here, here with a small number of prefixes it takes a long time, kind of drifts down and we have this weird spike in the middle and ramps back up. Have you investigated what was happening there? Do you understand why, and I am especially interested in that spike.
BENEDIKT RUDOLPH: Yes, the spike is anomaly that we experienced in the set‑up with one process using that exact amount of peers and I would like to deduce it for you. If you look at the graphs here, you see after, I cannot read it here, I have to read it on the beamer, yes after 50 seconds the process takes a break and then continues, and you even have that at the end of the experiment, and in the set‑up you mentioned with 250 peers, we experienced repeatedly that that the process learns 90% of the peers and then just waits and then the last peers come in and the experiment ends. These results are repeated three times and probably there is a systematic error in it that will go away when we do more repetitions. So take this spike and also the large variation which is indicated by the error bar take as an outlier. It is a trend, and it's consistent and I currently have no explanation for it either, other than the waiting phenomenon.
MARTIN LEVY: I am very aware of who is standing behind me but I will continue with my question. Two questions, actually, very simple one: In man‑hours, how long have you spent at this effort? Just as a rough guess.
BENEDIKT RUDOLPH: I joined DE‑CIX three months ago and this is ‑‑ this was one of the first projects that I started with. And a rough guess would be two or three weeks of continuous effort but I wasn't able to work on the project continuously, and I also made use of BGPerf and OpenSource framework that is readily available on GitHub and I just did the modifications to incorporate the multi BIRD set‑up into the benchmarking framework and I modified that of course to get the results in a way that I can present them here.
MARTIN LEVY: And so my follow‑up question is, would that time have been better spent either working directly on the BIRD code to add multi processes support and I am still very aware of who is standing behind me, and/or does this amount of testing directed at a different route server software pack such as go BGP maybe a more interesting thing to do and with that bomb Shell I will stand away from the mic.
BENEDIKT RUDOLPH: Yes, that is a very good idea to also test other route server softwares out there and this is why I chose BGPerf because it allows a very good, it makes it easy to exchange the route server implementation tested and it is even capable of testing BGP so it would be easy to repeat the set‑up with another route server implementation, but as I work for DE‑CIX our focus is currently on getting to know the limits of our production set‑up and this is why I specifically investigated BIRD, and I see the point, the man‑hours that I spent could be ‑‑ could be invested into improving actual BIRD source code and me being computer scientist, could do that, but however the focus was on producing this as of now.
SPEAKER: Ondrej Filip, I am you multithreaded but even though it was not intended to be so. Which version did you use and also if you configured the bug latency switch because I think many of the problems you saw in the BIRD process past and didn't do anything would be probably visible using this switch and you could probably debug what is happening there, because I suspect there is some problem in computation?
BENEDIKT RUDOLPH: Okay. Great to know that. Sorry, I used BIRD version 1.6, and but I am happy to repeat the experiments with a more recent BIRD version that just came out.
SPEAKER: It's just 162. There is no big deal.
REMCO VAN MOOK: I think we are going to have to cut it short because we are running out of time.
BENEDIKT RUDOLPH: We will take this off‑line.
REMCO VAN MOOK: Looking forward to future presentation. Thank you, Benedikt.
(Applause)
Next up is Leslie, some of you may be aware there was a panel during the plenary earlier this week which was about Internet Exchanges, trying to invade our territory, maybe. I have no idea. Maybe just enjoy the interconnection so much they also wanted to be part of the plenary. Anyway, Leslie set up the panel and she is here to summarise what happened, off you go.
LESLIE CARR: Yes, I did also realise I think most of the panelists are here. But, so I know that all of you were watching the entirety of the awesome plenary, right, so none of you missioned this, this is just going to be going back over this. You did miss my slides. So this is my cat, and look and as you can see, she is sitting on little one dollar bills because a lot of small or as some people have noticed, just less large Internet Exchanges usually have a little more restricted finances than some of the larger ones that we all know and love.
So, I was moderator. I am on the Board of Directors at SFMIX but since SFMIX is a very nonprofit, I also have a day job at Clover Health. And we also had Ulf from Swiss‑IX who I think is here in the room. Zoran from SOX, and Will from LONAP, and there we go.
So, one interesting ‑‑ so we talked about how our Internet Exchanges got started, and three of the cases they were started as community based Internet Exchanges, and SOX is the one Internet Exchange that was actually started from the beginning as a for profit entity. One interesting thing is that when starting most of these exchanges just started with borrowing, stealing, finding an old switch in the bottom of a cabinet. Really, one of the bigger challenges for Internet Exchanges isn't necessarily the technical side of the thing, right, because it's ‑‑ I mean don't get me wrong it's not trivial to set up one as many of us all know, there is a lot of work involved, but there is also a surprising amount of paperwork, admintive overhead and things like that, which are not quite as fun, and human hours are always restricted. Also, most European Internet Exchanges have private VLAN capabilities, something that many people did not realise, there were questions at the mic, and which is not the case in the US. And yeah, oh, sorry, I just completely lost my train of thought. VLANs, VLANs obviously most Internet Exchanges don't want to get into the job of providing transport because we don't want to compete with many of our members or hosts but in America, don't offer private VLANs where most exchanges do here in Europe. And yeah, do we have any questions?
REMCO VAN MOOK: I take it that none of you are asking questions right now because you already asked all the questions to the panel during the plenary, right?
LESLIE CARR: Exactly. Another interesting thing, I found at least, was for the cost numbers we are talking about these exchanges, their cost is not that much of a factor, as soon as you get your members signed up, for it to be a very interesting fact. I am sure if everyone started charging €10,000 a month it might become more of an issue.
REMCO VAN MOOK: Okay. Thank you for your summary, Leslie.
(Applause)
Which takes us to almost the end of our agenda. Let's have a look at where we are. I received word yesterday that somebody might want to be interested in lightning talk but that is the last thing I have seen of it. So, there might be a lightning talk but there won't be this time. So, with that, we are coming to the close of this session. We are actually a few minutes early so you can jump the queue for lunch, I think that's great. Let's see. Any feedback from you directly right now is what do you think of this session, was it shit, was it good? What are subjects you'd like to see on next sessions, now is your time to shine. Or you can just drop us an e‑mail if you are asleep right now. No? Wow. Okay. In that case, I thank you all for your time. I thank the scribe for being diligent and the stenographer for typing everything down so I can read what people are saying. And we are done for today. I thank you all, enjoy lunch. And before I forget because otherwise somebody is going to be really course with me, during lunch there is a table for ‑‑ about RIPE Atlas, if you are interested in coding against RIPE Atlas, it's ‑‑ so there is a developer there who can help you with any questions, so if you want to do something like that please go find the Atlas table during lunch. That is it.
(Applause)
LIVE CAPTIONING BY AOIFE DOWNES RPR
DOYLE COURT REPORTERS LTD, DUBLIN IRELAND.
WWW.DCR.IE