Plenary session
25 October 2016
12 noon
CHAIR: If everybody would like to please take your seats, we would like to get started. And again, please try to fill in the rows, don't sit at the edges. And before we get started, let me remind you that when it's time for questions, do not press the button before we point to you and make sure that it's your turn.
Welcome. My name is Jelte Jansen, and together with Joao Damas, we will be chairing this next Plenary Session. And I can give you a long introduction, but I think he is very capable of doing it himself. Our first speaker is David Fernandez and he will be talking about the SoloWAN project.
DAVID FERNANDEZ: Thank you very much for the presentation. I would like to present you a project in which we have been working at the Technical University of Madrid for the previous years. This has been something related, as you see, with Open Source within an optimisation, this project has been hosted by the centre for open middleware, which is a joint research centre funded by a big banking company, the Santander Bank in Spain and it was aimed for the development of Open Source software in the context of banging and application, it was, as I told you, something between the Technical University of Madrid and the Santander, and mainly all, most of the projects in that research centre were related to banking things, but there was also some projects, as it is SoloWAN that was related to networking. So that's why I'm here presenting this idea.
This would be the contents of my presentation. I will briefly introduce some of the ideas of one of the optimisation techniques concentrating on the duplication, which is the one that that we have implemented in the software. Then I will come to present you the ideas of the SoloWAN project and the activities we have developed, presenting also some new use case scenarios and I will finish my presentation trying to speak about the virtual testbeds scenarios, which I think it is an interesting activity we have in the development of the project.
So, about WAN optimisation techniques. The idea is simple. As all of you note the WAN links normally have a lower bandwidth, have higher delays, higher cost that the bandwidth that we have in the data centres or local area networks, so since some years, the companies, manufacturers have tried to investigate, researchers also, have tried to investigate techniques in order to improve the communications in that WAN link. There is a lot of optimisation techniques which comes from the classic WANs that consist on compressing the information that you send over that WAN link I mean.
Two other techniques that try to optimise the proposals that ran over that links. Dipically, there have been studies about, for example, how to optimise TCP protocol over these links because typically with higher delays in the link, the sliding Windows protocols do not work well, so optimising the traffic over this link means trying to improve the communications, just introducing the information that goes in the pay load of the packet or just trying to optimise a specific protocol. So it's a technique that has to be tied to any protocol, to each protocol that runs over there.
In the case of production of information in the compression techniques, applying the classic compression techniques is not good in this case, because applying a classic compression algorithm to the pay load of one packet, does not reduce the redundancy that you have in WAN communication. I mean there are some more advanced algorithm which are the ones I am going to present based on the duplication, which is something that nowadays we find in a lot of file systems, modern file systems, try to reduce the storage occupation by trying to find blocks of information which are duplicated and reducing that.
So the idea of the duplications algorithm applied to the communications over the WAN link, there is that that we send the packets and each time we send the packet over the WAN link, we have one optimisation device that will save that packet into a big dictionary, or a chip, and the idea of the algorithm will consist on packets that are sent. After that we will try to look for coincidences in the information that goes into that packet in order to try to reduce the information we send.
As you can see here, the idea is that if you find a coincidence in the information between these two packets, what we do is just, instead of sending the information that was previously sent and is stored in this dictionary, so, what we do is to send just a small block of information, just a hash, we reduce in this way the size of the packet that goes through the WAN link and then after delivering the packet to the destination, of of course we have to do the opposite operation, which consists on recovering the original information here in the dictionary, copying to the packet and then delivering to the destination. That is basically the idea of the duplications algorithm that are being used by commercial and operations of the products like the one I am going to present.
This, depending on the traffic, can produce important information or not. This kind of algorithm depends on how the information is. But the idea is that doing things in this way what he we reduce isn't redundancy that goes between all the packets of one communications or between the packets of different communications. This is something that applying compression directly to packet, cannot do.
So, this is basically the duplications algorithm. It can be applied to only traffic that is not encrypted. I mean, if you have an encrypted traffic, encryption, it means normally run the mice in the traffic so these algorithms are only efficient when you apply it to normal traffic.
So, this type of algorithm, you find it in several commercial products nowadays, which have very complex products because of what I said previously, that optimising means including in the equipment just optimisation techniques tied to any protocol. So you need to know the protocol with in a very detailed way and applying these kind of techniques. So the systems are expensive, normally you have problems of elasticity because there are systems that have a high cost in terms of equipment and licensing.
So, the idea initially that they transmitted when we began with the project is what about Open Source software basis WAN optimisation solution? That was the idea that transmitted at the company told us, this equipment is very expensive. We have a specific problem transmitting some protocols, ftp and http, not encrypted. So we want to know if it is possible to do kind of WAN optimisation based on the duplication algorithm using Open Source software.
So that was the initial idea of the project, and, with this idea, we began to investigate. That was the initial objective of the project, just to study which available solutions were in the Internet to create some virtual and real testing environments and at the end just give some recommendations saying if it is possible to do WAN optimisation using Open Source software.
Well, after investigating, we came to the conclusion that only three products were available. One of them we didn't even were able to set up, to make it run in the testing scenarios, and the other two were working but not... they didn't include all the functionality that we were looking for, in particular the OpenNOP was working well, but it didn't include the duplication algorithm. And the WAN proxy didn't have a duplication algorithm, but it was not working very well.
Okay, so the answer was that we didn't have a mature solution. So what we have to do is just to reorientated objective of the project and we dedicate the effort that it was not foreseen at the beginning to develop some added duplication algorithm inside one of the product that we find.
In particular, we started developing using the OpenNOP software, that basically the idea, or the reason that took us to choose this software was that it has a very well designed architecture that you can see in this slide. This represents what happens inside a Linux router when you route an IP packet. This here in the kernel, you have the routing packet, normally you come from one interface and here you have the routing tables and all the processing inside the router and then you send it over another interface.
So if you want to implement one optimisation device, what you need is just to still the packet to the concern in order to do the processing. So, what has a very good architecture, they use a net filter library, which is in every Linux, it's a nice library, that basically allows you to install the packet and to give you a domain name which is running in the user space and where it is implemented, all the logic that takes the packet, consult the dictionary, compress the packet, etc., etc.
So we were brave enough in order to try to implement OpenNOP just the duplication algorithm I mentioned.
We are not ‑‑ or we were not expecting an algorithm, so what we did is just to study the references and we find, we found a good algorithm and we tried to implement that. I'm not going not details of the algorithm because I don't have time. Just to say that, the main problem of this kind of algorithm is what is represented here, that you have a dictionary full of thousands of packets and what you have to do is when you receive a new packet, you have to find some coincidences between this packet and the packets that you have stored in the dictionary. And this is a very difficult task, because maybe you could have a two packets we are almost the same but the content is sift one byte, I mean all these kinds of algorithms are very computationally intensive, so you have to reduce and utilise use techniques like the ones presented here, that consist basically on not comparing the full packet but choosing some blocks of information, in our case 32 blocks of information, and try to compare it so what it is in the dictionary.
This is basically the ‑‑ the idea is what we implemented. So here you have some characteristics of what we have implemented, includes the original code of the OpenNOP software. We just implemented deduplication algorithm and we also included the possibility to apply both algorithms just to apply the duplication to the packet and then apply compression to the rest of the packet. That gives a good result. This was released in November of two years ago, and you can find here the repository where this is available.
We included some the virtual machines and we also developed some wire shark plug‑ins in order to help debugging.
What are the scenarios used? I mean, this kind of software can be used in these two scenarios you can see here. The classic one is the one, the Optimizer‑in‑the‑Network which consists on putting the Optimizer in a place where all the traffic between your places go before going through the WAN.
But also, because they asked for that, the possibility to include the duplication algorithms inside the host in order to duplicate the traffic that is being sent by some applications inside some host. That was the idea.
Normally, in this first case, you get better results because you aggregate more traffic and the more traffic you aggregate, the more redundancy you try to reduce. But in some cases, it's useful to have it inside the host. In this case, we provide a possibility to run it as a classic domain name or even as a docker container.
Here you have some SoloWAN tests. Unfortunately, it was not possible to do extensive tests in production environments and we only have done some laboratory tests. But here you can see a comparison with some other software, which was this software, Wanos was available later not at the beginning and it's not Open Source, it's commercial product. So here you have a comparison, sending one of the benchmark files, which is a set of files that represents a typical files that we send over the network, and this is a test, sending that set of files with no optimisation using Wanos and using our software. This one is then, the first time you send it, and of course the second time you send the same information, you have a very big reduction due to the duplication algorithm.
So this is not suitable for encrypted traffic, so this kind of devices have to be put in a place where you have the traffic not encrypted, or maybe you have to break the encryption, end‑to‑end encryption to do that.
Okay, more ideas. We have developed a way of application interface in which you have the typical view when you see the lower path is the traffic you have in the one, and the higher path is the traffic you have in the LAN. So you see the difference that the duplication algorithm makes.
More ideas, just to tell you, there was an article in a German journal that compared all these kinds of software, and the results were very good. They say that they highlighted that it was easy to install the SoloWAN and the compression grade was better than other ones.
More ideas: Just going to ‑‑ I will skip this scenario.
We also tried to improve things in order to give a scalability to this solution. I mean, this kind of software has a limit in the computational type of CPU that you have, and in case you want to increase the throughput that you get for these systems, what you have to, or what you can do is just to use Cloud computing paradigms and this is something we began to investigate in the project and what we end up designing and developing a prototype was this kind of architecture that you can see here, in which we have adjusted the Optimizer by a set of Optimizers with a load balancer that takes the traffic, this router, just redirect the traffic into the load balancer, that balances the traffic into a set of different Optimizers. So this is an architecture we have tested that improves the result, and mainly gives a scalability to the solution. If you need more powerful Optimizer you just get it adding new service. So this is the idea of the scaleable architecture.
More ideas we develop. The final attacks that we developed inside the project, was trying to increase this kind of Optimizer inside the Cloud. I mean, this is the very big trend that all of, you know, of integration network functions into the Cloud when creating a network function virtualisation clouds specialising in network function. So, the idea that we follow in this case was to try to integrate this Optimizer inside the Openstack Cloud. That was the idea, and the concept that we named, one optimisation as a service. Well we did this, more or less tried to reproduce what is in an Openstack. I will tell you briefly.
This is a classic scenario of an Openstack infrastructure. Here you can see the control node, the one that orchestrates everything. You have a network node which does communications for all the virtual machines and you have the computer nodes that host the virtual machines that duplicate. So the idea of an Openstack cloud is is that you deploy virtual machines. The client that comes to this infrastructure wants virtual machines to be deployed. So these virtual machines can be, or has to be connected through a virtual network and the network node, using this virtual bridges which are inside the computer node, you connect this virtual machine through virtual network and the network node has the task to do the communication virtual machine to the Internet.
So Openstack provides this service, the router and the NAT.
So, the idea that we follow is to just integrate this kind of services as it is in Openstack. The firewall as I said which is a service that you can deploy directly as a client of an Openstack architecture, you can also deploy a load balancer and you can also deploy now with our prototypes, the WAN optimisation as a service. That was the idea. So we work on that and we try to establish some testing scenarios that were made in order to test if this idea is factable. I mean, so we need a two Cloud environment that we were connected through our GCE tunnel and then what we deployed just two scenarios, the basic scenario that I have shown you, we deploy it inside two Cloud set‑ups and we, more or less, demonstrate that had this kind of optimisations have to deploy as a service inside Openstack. And just to finish, although I have more slides, I would like to finish telling that, unfortunately, we don't have big infrastructure in order to test all these set‑ups, so what we did in this case, the virtualisation came to our rescue and in order to set up a virtual testing scenarios, it's a very good idea to just deploy it as a virtual testbeds. I mean, you cannot test for four months over this virtual testbed, but you can test the functionality and more or less all the testbeds, or most of the testbeds that we have used to develop this software, were based on virtual scenarios. So in this case, all these scenarios has been virtualised and all these set up was just tested over two standard laptops that were connected using switch. In this case you can see we have double virtualisation, because all the nodes of the Openstack are started as virtual nodes and the virtual machines that you are start inside the computer nodes are built inside a host. All these scenarios have been developed using a software that is included here. You can see the reference is VNX, which is open software, and you have a reference to, in order to use these scenarios, you have a reference in the previous one to this. This is a reference to the scenario of the Openstack just in case you are interested.
So, you have to end up the presentation, I give you some references. There should be a conclusion slide, but the conclusion of the software is that the software that is hosted on that repositories, so if you are interested and you want to contact us, we are ready. So thank you for your attention, and sorry for the minutes delay.
(Applause)
CHAIR: Thank you, David. Are there any questions? Are you all going to use this now?
SPEAKER: As you said, this is only valid for a non‑computer traffic, and so do you have any statistics about how much the current Internet traffic is encrypted versus non‑encrypted to share the interests of these kind of solutions?
DAVID FERNANDEZ: No, I don't have figures on that, but the idea is that this kind of equipment are useful when you put it inside your infrastructure. I mean, you put it at the output part of your company. I mean, and normally inside your Internet company you don't use encrypted traffic. So, what you have is to adapt your network in order to do the encryption from outside. I mean, in the case of data centre, it's clear that the infrastructure inside is a safe place so you don't have to use some type of encryption inside. But I don't have that figures. This is one of the main problems of this kind of devices, encryption, what you can do is to break the encryption, but this is a big problem as you have to give the keys to that kind of devices. But it can be done in some products to do it.
CHAIR: Any other questions? David, thank you.
(Applause)
Our next presentation is from James Quinn. He works for this little site called Facebook and I expect it's going to be very big some day. So he is going to talk to us about scaling.
JAMES QUINN: Hello. I'm James Quinn. I am a network engineer at Facebook and I am here today to talk about how we built our edge networks in Facebook. I'd like to give you a sense today not just of the technology that we build, but also a sense of the culture and our approach to problem solving.
Of course, Facebook is huge. Everybody in this room knows that. Once you pass a billion users, you know that the scale of the networks that we have to build to support that service is enormous. But what I'd like you to see in this graph are these little strands of light. Every strand represents a connection between people. And particularly see how bright this graph is all around the world. More than 80% of the people who use our services are outside the United States. They are all around the world. And ultimately, all of them connected through our edge network.
Now, as I am sure the people in this room can appreciate only too well, networks are easily taken for granted. They are a lot like oxygen. When it's there, you don't even think about it. But the moment that it's not, you feel it immediately. Networks in general, and particularly edge networks at Facebook are a lot like that. When they are not where we need them, when there's not enough of them, we are strangling our services.
So what is that service that we are providing through our edge networks? At Facebook, a part of it is similar to most content delivery networks. It's the static content, the safe photos and videos that are cashable that you are familiar with. There is also a part that is different and unique to our service, the dynamic component that's continuously changing, status updates. Comments, every time you click a light button, that dynamic content is something that is continually changing and needs to be consistent for everyone who uses our products all around the world.
Like you might have ‑‑ let's go to a star that you really like and you want to share them with your friends. Or maybe not.... but assuming you do, whether you are clicking a light button on your friend's post about their new job or love for their picture of their new child, or expressing sadness over the loss of a good friend, fundamentally what we provide is a communication medium. And that communication needs to be true, it needs to be right for the people we serve all around the world.
So, I'd like to take you on a journey down memory lane. Five years ago our service, our network were very different than what we are today. If you connected to Facebook, your TCP connection would actually go all the way back to a server in the United States, in a few data centres. If you were in the US that was kind of acceptable. But as you can see in this picture here, for much of the world, the latency to our service was terrible. To dig into what that actually means. Let's take an example of a user in Asia connecting to a data centre in Oregon. The round trip time for TCP might only be 150 milliseconds but when you break that out to a full connection SSL, bringing down the web page ‑‑ actually your first http get, it's more than half a second. Extrapolate that out to rendering a full web page, that becomes an unacceptable experience, you are not going to like that service.
So we needed to build not just edge POPs, not just edge routers, but we needed to build edge servers, to terminate those TCP connections closer to the people who use our products.
But doing that, we dramatically improved the performance. Yes, that dynamic content still needs to be pulled from our data centres, it still needed to be synchronised globally but it's pulled now through always on connections from those edge servers. So ultimately, the performance, the latency to actually deliver content to users, far faster than it was. So what does this mean in the network perspective? We started with simple edge network connectivity, building routers and edge POPs to connect to networks around the world. We needed servers to terminate this TCP connections closer to the people who used our products and this much solved the basic problem of the Facebook landing page rendering that framework well. But our service continued to change. We had more multimedia content, rich photos and videos, we needed a lot more servers at our edge to cache all this content. We needed to build cross fabrics and full edge server clusters to support all of this east/west traffic and south/north traffic and good that we did. We were ready for the changes in our service that we didn't expect. I still remember the weekend when a software engineer encrypted on the auto play button and when you scroll through your Facebook feed you began to see the videos dynamically playing automatically. That had a real impact on our network. And good that we built something like this to be ready for it. But our service is always changing.
And today more than ever, video is growing. You may have seen the recent launches over the last year of our live products, whether it was Mark's broadcast with the space station, the three recent US presidential debates, this is a fundamentally different type of content, in that hundreds of thousands or perhaps millions of people can look at the same rich content at the same moment in time. That means it's a very spiky, it's impact on the network is very different than the previous content we served.
And that means we are building different types of networks today. We are building multi‑layer topologies, of course 100 gigabit everywhere. We needed to move from the old model of single router GOG boxes connecting to our internal backbone, our servers, to disaggregated functions. Simpler roles, layers where we can grow quickly with simpler devices, and at the same time we needed to reconsider what we do from a protocol perspective. So over the next few slides I'll walk through the evolution of our software stack and how, in many ways, it allowed us to move beyond the simple limitations of BGP.
Now, of course the world runs on BGP. Most networks load balance, distribute their traffic using the mechanisms of BGP. And we were no different. Our network originally looked much the same. But it didn't solve all the problems that we have. Take, for example, you might have a user in Africa and your shortest AS path to Facebook is to a POP in Japan. You know that that's not right, but BGP has no mechanism to document that to prove it. So we built a system we called Sonar. What it allows us to do is measure the literal service closeness of networks around the world to our edge clusters around the world. And it does this in a way that's probably invisible to all of you but a very large number of profile pictures on Facebook are served from POPs that we expect to perform poorly. We do this because services are always changing. Networks are always changing and we want to have a continually updated data set. What this data set represents is the latency in performance for nearly every network in the world where people connect to our services to nearly every edge cluster in the world. And this data set is one of the inputs to our global controller system. Our global controller is aware of a lot of different metrics and data on our network. It monitors every single peering interface on every peering router in the world. It pulls BGP tables from every peering router. It's aware of the health of our servers and our networks and most importantly it's aware of the Sonar data measuring the performance and closeness of networks to our servers and from this it builds a DNS map which allows us to distribute the ingressed load by giving different DNS resolution to say resolvers around the world and pull traffic to different POPs.
Now, we're all network engineers, we love graphs, what does this actually mean? The red line in this graph shows what our network liked like in a POP before we had this global controller system. BGP would simply pull in traffic, and as users woke up, came online in the region we could exceed the load of a POP. What that meant was that the performance for all the users coming into that POP was degraded. It's like a highway, when too many get on it as once traffic grinds to a halt, people can't move. It's bad for everybody. But by being able to shift this DNS map and redistribute portions of users to other POPs with more lanes, more highway, more room for capacity, we are able to create this green line where we control the utilisation, we don't exceed the capabilities of our systems and everybody has a better experience. If you break this other regional level you see as the POPs in that region, users wake up, all of those POPs might reach their capacity and we can begin to bleed traffic away. Globally you see intersecting diurnal patterns as different regions come online, wake up, other regions go to sleep and we shift traffic around to where we have the capacity to serve it well. What does this mean for all of you? Your cat videos load faster.
Okay. So, how does this global targeting system we have interoperate with BGP? Originally it did not. Originally we just Teredo it as it was. BGP did its thing. Managing our egress and this new system managed how we ingressed traffic to our network and it kind of worked okay. It could mean, though, that we would pull traffic, say, into our POP in Europe and because of our fixed BGP policies we might pull that to appear on the other side of the world. Now, eventually, this came to be a problem for us. Only in the sense that our Sonar data set had evolved to the point that measurement of closeness that we had an opportunity not just to influence how we ingressed traffic to our network but also how to influence we egressed traffic in a way that could improve performance for our networks, our peers, the people who use our services. This became particularly acute during a very serious outage. The sort of outage that you don't plan for. Multiple fibre paths, diverse fibre paths cut on the same day in a significant region forest. Now we had the opportunity with our global controller to shift our DNS map to another region with capacity to serve this traffic pretty well but the problem was these fixed BGP policies pulling this traffic across our broken backbone to the same peering links degrading the performance of our service that much more. We needed a different paradigm, a different design. We needed to confine BGP to smaller islands, city level islands. Now this didn't happen in one step. It took us time to develop our global controller software to do the right things. But ultimately, it allowed us to control, with our DNS maps, not just where we ingress traffic to our network but also where we egressed traffic from our network. Slowly, finding the sticks, finding errors in the software, accidentally draining peers we didn't want to drain. Making that software better. Slowly confining the domain of BGP, expanding the power of our controller, iterating, learning, we were able to make a system that's far better for us today where we can have far more influence on the performance of our service.
Now, even containing BGP to these little islands, BGP still is a bit of a problem for us. It's limiting. We care about capacity, latency, we fundamentally care about the performance of our services. BGP cares about the attributes that you all know about. Now, you can statically encode those measurements of performance into BGP attributes, but that's just a measurement of one point in time. Networks change, services change, whatever youen code will grow out of date. So, we needed to build a dynamic mechanism. A local network controller.
Now, over the next few slides I'll walk through things we tried. The iterations we went through to get to where we are today.
Like a lot of people, we started with service routing. Server routing. Make decisions about where we forward packets down in our server stack. It looks fantastic on slides, it's a great idea, complete control, you can do anything. But it was not so practical for us in the real world. It was not so good for solving real problems.
For one you have created entirely new problems for yourself to manage. You have to synchronise these decisions across servers. In our case, a lot of servers, and whatever your services change, what have your network changes? What have traffic is coming from someplace you didn't expect or plan for? And even if you solve all of these problems here and we never fully did, you still have the most basic problem, how do you signal from these servers to your network layer to forward the traffic? How do you encapsulate this traffic? We tried a few different things. We started with MPLS, two basic problems. Kernel support wasn't there at the time for our servers, and now you need to support this MPLS layer like in every layer of your network, top of racks, which is cross‑fabric switches all the way up to your peering routers, and our platforms at the moment didn't support this. You could try policy based routing or filter based forwarding depending on your vendor of choice, this is where we spent the most time. The first basic problem is you have now shoehorned yourself into a vendor specific feature set. You now have to manage router configuration state for these source routing decisions and keep it synchronised with these powerful software systems you built on your servers. That's not so easy to do. We had problems with that.
If you go with this approach. What bits do you use in your IP headers for these PB R decisions in you could use DHCP, we spent a lot of time with this. It works, but you have a very coarse address range, you can only make so many distinct decisions with this. We worked towards using the key value in GRE headers, a very large address space, but by the time we had built this software, our service had changed. We had new services with very small packet sizes that went to very high data rates as we hit our peak load. In the first two weeks, when we deployed this, we literally crashed our two largest POPs in the world. Now, I'm sure everybody in this room has had this experience. You had this problem, this ‑‑ and you have an idea, you have this feature or architecture or something that will solve it. And you do all the hard work to bring this into your network, only to get there and deploy it and it's like, oh, dear, it didn't work, it didn't solve the problem. At Facebook, we're not different. We try a lot of things that do not work. If there is a difference, it's that we expect to make mistakes as a part of our process. We try hard to fail fast, to learn, to iterate, to work towards a solution that that will be better.
What did we learn here? If we learned anything, it's to keep it simple. Now, what is the most simple and IRR reducible element of any edge network? Of course it's BGP. So we needed to find a way to evolve beyond the limitations of BGP by using BGP.
Now, what does this mean? The most basic problem you have a peering router, you have appear, and you have too much traffic on this link. Your BGP attributes are the best ones, local preference, AS path, this is the link this traffic needs to go down it, but there's too much of it and you are dropping traffic on the floor. But we have data. We have BMP exports of our full tables. We have net flow data telling us how much utilisation we have for every single prefix and we have counters for interfaces showing us the capacity and utilisation and packet drops. All of this data can go into our software and our software can make a better decision: Where should this traffic go not limited by simple BGP attributes? And here is the key: It implements this decision as a literal BGP route into our peering router route table. Off‑loading enough traffic to alleviate the overload condition, no traffic is dropping on the floor, happy links, happy services, happy users.
But you might have noticed in that last picture there was just one peering router. That was quite deliberate. The original iteration of this service is super simple. That one peering router God box with the backbone connection and this server aggregation and lots of different peers, at that time our network was simpler, we could expect that appear router would have diverse peering connectivity, public exchanges, private connections, transit, somewhere to send this traffic to. But if you think back to that multilayer edge topology I showed earlier, our network is not so simple today. We don't build a peering router. We build a peering layer. And we can't expect that the next best peer, diverse peers will all land on the same physical box. So we needed to move from a router model to a city model, a metro model, managing an entire peering layer at one unit. The mechanisms are not very different. The data is the same and the BGP injection is the same. What's different is we now leak this within a metro. So that we can pull traffic between any peering links in that metro across peering routers. Now you might think at this point, awesome, you made all these mistakes, you worked through all of this, it's got to be done, right? Of course nothing is ever perfect, nothing is ever done. There is always new services, new problems, something else to solve. In this case it's no different. I'll walk you through you a few examples of problems we had even after we moved to this model.
For one, what happens if you are an IPv6 rich network, Facebook certainly is, and you are connecting to another IPv6 network. IPv6 is extremely summarisable. You can represent an entire IPv6 network in one prefix.
Now, if you offload that prefix, you haven't alleviated an overload condition you have drained your physical link, which is not what we wanted to do. So we built a system to create sub‑prefixes where if that link becomes overloaded, we can offload one of those sub‑prefixes to a better peering link using the exact same mechanisms we used earlier.
Next problem: If you think back to the global controller I showed you earlier and the local controller I have shown you more recently, both of them had visibility to physical links, to BGP tables and that local controller actually influenced that BGP table. You might have thought, how did it work? How do these systems work together? Initially they were ships in the night. They were unaware of each other. That caused us real problems eventually. A link might become overloaded. Local controller moves fast, it injects a BGP route, moves traffic away. The global controller moves slowly. It's managing DNS, DNS is slow, but eventually it moves traffic away from the POP and from this link. The local controller sees this, the link is unloaded there is not enough traffic there, it quickly inject the BGP back and moves traffic back to the link. The global controller saw the same problem but DNS is slower still and you have overloaded the link again and this could happen all day long. What did we learn from this? You need clear roles for your controller. The local controller managing just traffic between links in a local metro and only when that metro is exhausted engage the global controller to shift the DNS map. But even within one controller, you can have os legislations like this. In this example here the local network controller is managing traffic between two peering links and, as you see in the graphic, it's quickly oscillating traffic between them. Now, when we dug into this, the controller thought it was doing the right things. But the data was wrong. That BNP data, the counter data, the net flow that I talked about earlier, it was collected globally in our data centres, it was built at a time when it was used for human consumption, a human being would pull up a grave at some later point and look at the graph. It wasn't built with the idea of realtime automation systems moving traffic on a live network. So, we needed to shorten this pipeline. Move our collections into the local POPs around the world, collect it close to the routers, close to the controllers, more precise, more granular, feeding fast, accurate data to our controllers.
And with that, we ended up with the picture on the right. You will never have a flat line in a real network. But we don't have the oscillations we had before.
Bringing it all together: A global controller feeding traffic to our POPs, within those POPs multi‑layer topologies managed by our local controllers, we are not done, we have much work to do. Increasing the modularity of our network, the resilience in our layers, the diversity of our systems and networks so we can work around failures faster. The programmability of our network. You might have heard about Open R with applications in wireless mesh networks our internal backbone, we believe it has an application in edge networks as well. Moving away from the old paradigm of protocols and begging vendors for features. In open model where we can iterate, try new things, we can marry it our software systems. Last but not least, performance. I talked earlier about Sonar but we have much more performance data we believe we can use. Of course an edge network is not like a data centre. There's a lot of noise on the Internet. It will never be simple. But we believe we can pull signal from that noise, marry our programmable network with our software systems to deliver a better service for our users.
Now if you take anything from today's talk, I hope that it's we did not have all the answers. We did not know where we would end up when we began. We had much to learn. We started small, we tried things, we iterated. We learned. Now there is one constant, I hope it's our values. We value operations, practical experience, trying things out, seeing how they work, over abstract features and perfect fail safe architectures, but we always have much to learn.
Now, I'm sure many the people in this room have problems similar to ours. And you may have different approaches. In fact, I hope that you do. And if you do, please seek us out. We would love to talk. Thank you very much.
(Applause)
Any questions?
CHAIR: If somebody is waving, please wave harder! I see a question over there.
SPEAKER: Thanks for your presentation. I am Francois, ANSII. And when you presented the first slide, the architecture, with the DC, the POP and the customer, you showed something that takes place between the customer and the POP and stops there. And I am asking myself is the traffic encrypted between the POP and the DC?
JAMES QUINN: Ah! We use a lot of encryption, but I couldn't speak to all of our applications, I'm not familiar with all of that stuff. So it would be hard for me to speak to that.
SEBASTIAN: It seems your process has been interative [sic] so you have been going on and changed things and improve things. Are you considering any level of matching learning for this to take the decisions or some of the decisions.
JAMES QUINN: As a matter of fact, yeah, that is one of the things that's under consideration, particularly as the performance data that we are bringing into our systems becomes more complex. In the long range, we do want to bring more intelligence into how we make these decisions. Today, our software systems are not that complex. The data that we're operating on is fairly simplified, deliberately, but in the future we believe that will not be the case and that may have benefit.
SPEAKER: Do you use any prediction algorithms for shooting traffic between edges or it just reacts on traffic all shifted? If yes, which kind of algorithms? Thank you.
JAMES QUINN: I'm not sure what you mean by prediction algorithms, but we're using live data from the network which means user requests are coming in, we see traffic showing up on interfaces. It's reactive. We see data and we react to that data. It's not a prediction thing, it's showing up and we are reacting to it, if that makes sense.
CHAIR: No more questions? Then, thank you again, James.
(Applause)
And the final speaker of the session is Francois Contata.
FRANCOIS CONTAT: Good morning everyone. This is the name of my agency. In English, it's French National Agency Regarding the Information Systems Security.
Okay. So, we are a public service. Being that, we are often called by small ISPs or small transit providers that are working in France, and they are facing DDoSs and they ask us what they can do in order to mitigate it.
Because we had a lot of questions, we decided to make a guide, a DDoS guide explaining what is a DDoS and what kind of solution exists in order to mitigate it.
And when we did right this document, we decided to go deeper and learn about all the techniques that exist.
When you are facing DDoS you have two types of way to react to it: Doing nothing and asking someone else to do something for you, like your transit provider, or a DDoS mitigation company, which announce the /24 of your attack prefix, clean the traffic and send it back to your service, so the traffic is clean, it's okay. Or you can cope with the problem yourself. And this is where this presentation takes place.
Your situation in case you want to deal with the DDoS yourself is you have the DDoSs that are attacking your own backbone, your Hostservers. Your backbone admin, what I mean by that, you have an AS number, you have prefixes, you have services that are hosted, like web servers or gaming servers and so on. Because you want to deal with the DDoS yourself, you must have enough transit bandwidth in order to get the good and the bad traffic.
You have services that are down and you want to bring them up while the DDoS is still coming. So that's why the RTBH solution in that case is not acceptable for you.
What I want to present to you is a DIY solution, which is kind of cool.
But before going to my solution, let's have a look at what exists. Today, when you are facing DDoS you can put ACLs on your routers. Okay. As you saw with the Facebook presentation, there are a lot of routers in a network. So, putting ACLs in CLI in all the routers can take a lot of time, you may forget equipment and then create some holes in your backbone. And when the DDoS is finished you have to remove them. So ACLs are not that good in order to deal with DDoS mitigation.
Also, you can use FlowSpec, it's pretty great but you have to have routers that are compatible with FlowSpec. You have install FlowSpec in your own backbone and FlowSpec is not only layer 3, you can put rules that go up to rare 7 and it can be costly regarding the CPU or the analysis. And also, with these solutions you have a limit into each vendor. With FlowSpec, depending on the vendor and type of equipment, you will have a limit to five, 10, 15k entries, so, it's kind of... it can be a problem. And finally you have the vendor‑specific solution. It's a black box you install in your backbone. You have an attack. You divert the traffic through the black box, the black box washes the packets and then you have clean packets that come through your system. It's okay. It works. But the main problem here is it's costly. You must buy the equipment, I install it in your backbone and then if you want to do a big amount of traffic you have to put a lot of money on the table.
I'm working for the State, so I have no money. That's why I have been working for a solution that are not costly.
So, this is the classic solutions, but another exists and a lot of people forget about it. And when you look at ACL and FlowSpec, what are they trying to force the router to do? It's filtering. And is a router good at firewalling, at packet inspection and so on? No, I don't think so. What a router does well? Routing. Yeah.
So, let's go back with routing and filtering with it. We have URPF strikes again. And it's kind of simple. We simply use URPF in order to make the filtering.
So, we have ‑‑ I have only one router. You have the Internet, a router which is connected to a transit provider. We have a backbone, and a web server. I have only one customer. We have the red and the blue prefix that are the good and the bad ones. The blue one reaches the web server and access web pages, it's legitimate, everything is okay. And the bad one arrives. Then it sends a lot of TCP things to my web server. It's under attack. I don't like that. So, what I put, I just set you up here on the upstream interface, as long as I have URPF and only one inter‑connection it works. And then I simply add the red prefix with the next stop hop Null0 with the router and then the URPF does the magic. The attack has mitigated big time.
Let's summarise what we saw. We have an attack prefix, we just have to advertise that prefix. Okay. That's something we can do. We have line rate performance. Because it's routing, so, no issue there. And in case you are multihomed, you just have to put the loose mode on your RPF and it will work. Of course you have to check with the vendor that the next hop on the prefix will be working with the loose mode. You just have to look at the data sheets. But, there is a bit...
Maybe the red prefix has been spoofed by another prefix on the Internet, and what they want is to block the access to your backbone for these customers. So, okay, you may drop customer traffic, and that's something you do not want. So, it's a good solution but it's absolutely not ideal. Okay, we have some good ideas there. Now, let's have a look on how the vendor‑specific solution works. We have the Internet, the same prefixes, the good and the bad one and they are all accessing the web server. We are under a DDoS, we want to mitigate. We plug the black box into the transit router, it learns the traffic and we divert the traffic through it. We just announce the /32 of the web server and redirect the traffic through through the black book. And the web server. Everything is okay. Then it does its magic. It scrubs the traffic. Okay. Great. But we are too many issues there.
It is a voodoo issue ‑‑ sorry, it is voodoo magic. We do not know how the black box works. And I like Open Source systems, I like to know how things are done. So that's why I'm not a big fan of that, even if it is working.
And finally the capacity depend on the money you put on the table and the amount of bandwidth you can cope with.
So, we saw two different solutions with great ideas. And you know what? What about merging both solutions? And end up with something else. I call ABH.
So, same architecture, I have some money so I have a second customer. We have the two prefixes, the red and the blue one, they are accessing the service and they are under attack. Okay, what we do is simply add a router. Which I called ABH. It's just a name. We have the router that is plugged to the transit router, and what we do is using the diverting technique of the vendor‑specific solution, so we announce the search, the /32, and we use URPF on the interface. We only have to announce the /24 to Null0 and then what happens? Well, only the bad traffic is dropped. Because, as you can see, we redirect only the web server that is on the left to the ABH router. The rest of the traffic flows through the backbone.
So, we only divert the traffic, we only set URPF on an interface and in order to mitigate we use the static routes to Null0. So pretty easy. And also, we did cope with the layer 3 level. If you want to go higher, we can use ACL or FlowSpec. You can see this solution as only a first line of defence.
Okay. Static routes... I told you I want something that is more flexible, more easy to set, because I was harder versus the ACLs. So, let's make something more flexible.
Let's imagine you have the backbone, you have the ABH router, which can be a router that you get from your spare, or a Linux with eye bandwidth network cards and what we need is just add something on the right which is a BGP process we can interact with, which I call the injector. How you set up this, you simply put the ABH router in the iBGP in order to get the GLT, all your infrastructure and the Internet, you connect the injector to the ABH with a new BGP connection. It can get all the information and it will be able to communicate with the ABH. So, of course, we still need the URPF on the interface, and in case of incoming traffic, we are under attack on the left server, how do we announce the /32 of the customer? We simply use the injector and announce the /32 with the specific community we decide. In my case, I chose a hard number, which is 1234, and then we have a route map, an ABH router, that will reannounce this route and himself as next hop. Then we will divert the traffic to this /32 only. And when we want to mitigate the red traffic, we simply announce its prefix with another BGP community, which is 66666, and we set the next hop with another IP. And URPF will do the magic.
So, okay, we did mitigate the attack. So, I show you two different techniques, and the injector, which is on the right, on the drawing, you can see the BGP process you instrument, you have a lot of ways in order to feed him, to give him some information and stuff that will help you to mitigate traffic. The first one will be ‑‑ this is just an example of ideas I had ‑‑ the first one will be to use the well known IPs you can grab from exploitable services like DNS, NTP, etc., that are huge sources for DDoS amplification attacks and there are some initiative like Open DNS project, Open NTP projects that list all the IPs of the open services that could be used in order to create DDoSs. So, you can create static files with all these /32 and you simply announce it from the injector to the ABH router. As long as you do not announce a service from your ABH, you will not divert traffic, but if you are facing an attack you simply have to divert the traffic through the ABH router and you will mitigate it.
Also, depending on your business, you can use geographic community, your transit provider can attach to the routes, they are announcing to you, net flow statistic, geographic IP, and so on. You can use a lot of different ideas in order to create and customise your own mitigation solution.
I made a proof of concept of this, and of course, there is something you must do, and it is mandatory, never forget it, it's to put the no advertise community for all the prefixes you receive from the injector to the ABH router, because it will be really messy in your backbone if you do not. You use a huge local preference in order to have a performance for the routes you want to null traffic. But this is not the most interesting part. The injector I created with ExaBGP, a project from, I think most of you know ExaBGP, it's made with Python and quite flexible, so I created a Python script that connects to the ExaBGP and can write on the STGL routes. I learn the routes from the eBGP, I found the routes that match the criterias I decided that corresponded to my approve of concept and I simply inject the routes to the ABH router, and it works.
What I want you to see, ABH is not the panacea. It's not the solution for coping with the DDoS and it's all magic. No. It's a first line of defence. Because, it takes care of only the layer 3 packets.
So, what you can do is to first do the filtering with the ABH, and then implement higher level defences like TCP SYN Cookie, or FlowSpec, or maybe now that we have less traffic, we can send it to the vendor‑specific solution.
What I presented here is a cheap solution. You can use a spare router, a Linux server with eye bandwidth network cards, which is not so costly, and guess what? It's a network‑based solution, so you take back the power and you can be proactive with the creation of set of criteria in order to do the mitigation of an attack. You know your business. You know the kind of traffic you are facing and that is quite legitimate. So, you can create some list about that, and put clusters from the NetFlow information you can get.
Also, it's a scaleable solution. You can use BGP paths in order to multiply the system of ABH router that could be used to do the mitigation. You can use only one injector in order to set the rules for all the ABH. And as I already said, it's the first line of defence. It's modular, it's not all in one solution. You can add and improve this solution with high level ‑‑ higher level solutions.
I talked about that to some friends who work for CDNs and ISPs, and they are actually looking to it. The main stuff ‑‑ the most important part of their investigation regarding this solution is to put the ABH router into a VRF, which will be really good, because it will cost zero and it will be really, really preferable, so I'm looking for that. Of course, I am totally open about feedbacks.
And I just created a GitHub repository, which is called advance blackholing. You can click the PDF file in order to access to this. Also, in the agency, some people know about that, but we created an Observatory which is looking at the resilience of the BGP and DNS infrastructure in France. This is something we created with AFNIC, the TLD of the people who take care of the .fr TLD, and we did publish a report yesterday which is in English, so you can download it with the click meeting, it's available on PDF, Kindle and Epub.
Thank you for your attention, and if you have any questions I will answer them.
SPEAKER: Which will from IP‑Max. Nice presentation, but I see a problem there. I also had a ‑‑ video link. I will just video link and if you are like so far and you have only got one 10 gig, I will send you more than 10 gig and then cannot do anything. The concept is really cool. I like it and I will probably implement that but ‑‑
FRANCOIS CONTAT: As I said first, if the transit is totally full. This is not a solution, of course. Okay. First.
Second, as I said, you can use BGP equal cost multipath in order to distribute the path to different ABH routers. So let's imagine you have only one transit link, okay, but it's a big one, the good and the bad traffic is under this capacity, so, you have the capacity to under the traffic. You have your transit router and you just multiply the number of ABH routers that you connect to it and you use BGP equal cost multipath in order to divide the traffic. Is that clearer?
SPEAKER: It's okay, but I still think I will reset your link. But I like the thing. I will get you back to you after that.
CHAIR: Any other questions?
Francois, thank you.
(Applause)
CHAIR: I have requested the venue staff not to serve you any food until you have rated the presentations. So, please go to the meeting website. Rate the presentation, then enjoy your lunch and we'll see you back here at three o'clock.
(Lunch break)
LIVE CAPTIONING BY
MARY McKEON, RMR, CRR, CBC
DUBLIN, IRELAND.