VMware VeloCloud Edge configuration quirks

3 CommentsPosted in Networking, SD-WAN, Uncategorized By douglash@gmail.comPosted on January 17, 2020January 17, 2020Tagged sd-wan, velocloud, vmware

I recently had an opportunity to experiment with a VeloCloud Edge ‘gateway’ before/while deploying it, and thought I would write up the ‘quirks and features’ that I ran across. Some background, I’m used to deploying solutions that provide many ‘nerd-knobs’ and the ability to look at logs and flows with relative ease. VeloCloud, from a customer’s standpoint, is very much a hands-off, let-it-do-its-thing solution. There’s pros and cons to this approach, depending on what your needs are, and how far you are willing to trust the magical ~~black~~ white box to do what you want it to do. In my experience, VeloCloud certainly gets the job done, however sometimes I wish that I could have more visibility into what’s going on behind the curtain.

First quirk is the limitation on how WAN interfaces can be configured. On the Edge appliances, only ‘routed’ interfaces can be used for WAN connections. Typically these routed ports are labeled as ‘GE1’, ‘GE2’, SFP1, etc… using the ‘LAN’ ports are not available, as those are considered ‘Switched’ ports. What this means is on a smaller appliance, such as the 520 model, you’re limited to 2 copper WAN ports out of the box. Probably suitable for most smaller branches, but if you need more, then you’ll need to install SFP modules to enable the use of the other 2 ‘routed’ interfaces.

Supposedly there’s a way to configure 2 WAN connections on a single port, however it seems to be designed around an ISP connection that provides both an Internet WAN connection, and a private WAN connection, such as an MPLS or Metro Ethernet, over the same physical connection. One connection would be either untagged, or tagged with a VLAN ID. The second connection would of course be tagged with a second VLAN ID. This is not an unusual configuration for SD-WAN and Firewall deployments in multi-internet deployments, especially when dealing with high-availability configurations, or multi-gateway/SD-WAN environments, that need to share the internet connections between multiple devices. Typically this would be done by connecting the internet connections into switch(s), each one on a specific VLAN. I attempted to setup the Edge gateway with both internet connections on one physical interface (with one or both connections tagged with a VLAN ID), however I was unable to get both working reliably. I switched to the standard setup (one port per WAN connection), and installed a copper SFP module into the Edge so I could also have the use of a ‘Routed’ LAN interface. More on Routed LAN interfaces below.

On the topic of LAN interfaces, we have a two types. The first type are ‘Switched’ ports on the back of the Edge. You configure these ports in the same way as a basic Layer 2 switch… each port can be setup in ‘Access’ or ‘Trunk’ mode and be configured with any VLAN interfaces that you configure. However, keep in mind these VLAN interfaces cannot be used with OSPF to peer with other OSPF routers. OSPF can be enabled on that interface, but only in passive mode.

The second type of LAN interfaces is a ‘Routed’ interface. This cannot be shared with a WAN interface (no router-on-a-stick allowed here!), however there is more control available over the interface, including having the ability to enable OSPF for peering with other routers. You CAN create a sub-interface on a ‘Routed’ interface, however OSPF cannot be enabled in any capacity on that interface.

Side note: I believe BGP is not bound by these interface limitations, however I did not have a BGP router available at the time to see if that’s true (including VLAN interfaces). This would make setting up dynamic routing a bit more flexible in an environment that can utilize BGP.

Other quirks I’ve found:

Packet captures are allowed, but it takes a few minutes to start capturing on the specified physical interface, and you’re limited to only a 120 seconds. This makes it tricky to coordinate reproducing an issue with a user, since you may be ready NOW, or in 2 minutes. And if it’s an intermittent issue…
Some changes require the VeloCloud services to restart. This may cause traffic to stop flowing temporarily.
Edge Gateways have a limited number of tunnels they can support. Don’t expect to be able to support concurrent branch-to-branch tunnels between ALL your branches, if you’ve got more than a few dozen branches with dual internet connections at each site. You can do dynamic branch to branch tunnels, and traffic will dynamically ‘float’ to one of those branch-to-branch tunnels once it’s established, but be careful with this. Dynamic VPN works in simple environments where branch-to-branch communication is rare (such as the occasional VoIP call, etc). But if you don’t limit what traffic is allowed between branches, you could run into situations where some process behind the scenes (network vulnerability scans, Windows Update peer-to-peer sharing, roaming users that have ALL the printers installed locally on their laptop for ALL branches, etc) will trigger the creation of dynamic VPNs between ALL the branches. The Edges will run out of memory or CPU and start to drop traffic, OSPF will drop routes intermittently as the OSPF process competes for resources, etc. Make sure your appliances are SIZED CORRECTLY not only for the number of SITES, but also the number of internet connections at each site. If someone gets the bright idea of slapping a 4g card in each Edge Gateway for a backup connection… you’ve now added N tunnels from each branch to your hubs.
You can’t ping the LAN interface IPs of the VeloCloud Edge Gateways… at least out of the box through the SD-WAN. This makes troubleshooting connectivity issues more exciting.

In a future post I’ll do a comparison between VeloCloud and SilverPeak (since I have experience with both solutions) and how the two differ in deployment flexibility and manageability.

VMware VeloCloud Edge configuration quirks

3 thoughts on “VMware VeloCloud Edge configuration quirks”

Vladimir Franca de Sousa says:

February 4, 2020 at 3:32 pm

Hello fellow networker 🙂

Interesting post, I would like to make a few remarks and ask a few questions around a few points you raised.

Note that though I am a VMware employee (you can read my profile on linked) working on the SD-WAN organisation what I am writing is just my view/opinion and I have not cross checked or validated with engineering or any other VMware contact.

Anyway, in regards to this point:
VeloCloud, from a customer’s standpoint, is very much a hands-off, let-it-do-its-thing solution. T… however sometimes I wish that I could have more visibility into what’s going on behind the curtain.

The so called black-box style , and the added fact that VMware SD-WAN has a very simple and to the point GUI where you will be doing all your configuration , monitoring and troubleshooting (where you do have some good visibility on the needed points), actually it makes the solution quite attractive to customers. It is a very easy to learn it and use it requiring really minimal effort for anyone to be onboard in the solution and get deployments done in a matter of minutes after spending a few hours learning the product. We also have many out of the box defaults , and auto discoveries that are quite helpful in contrast to what some other vendors will give you (template based – almost CLI like style). Being a NERD myself I understand where you are coming from, but we are looking for that is best for the actual end user of this technology :).

Btw, I used to do DC SDN work for another major network vendor (see my LinkedIn profile) and trust me , the ability to look under the hood didn’t make it any simpler in understanding what was going on 🙂

In regards to the tests itself, Could you confirm what is the actual version of the VCE you tested?
I believe this is quite an important point when blogging about a SDN solution you tested.
For example the remark below t is no longer true, you do have the option to configure “ICMP Echo Response” in the routed/vlan interfaces.
“You can’t ping the LAN interface IPs of the VeloCloud Edge Gateways… at least out of the box through the SD-WAN. This makes troubleshooting connectivity issues more exciting.”
I

Could you be a bit more specific in what you meant by “I attempted to setup the Edge gateway with both internet connections on one physical interface (with one or both connections tagged with a VLAN ID), however I was unable to get both working reliably.“?
I have done this implementation in many LABs/POC/Pilots and seen it been done by others SE 🙂 Also we have officially tested 16 overlays out of a single interface 🙂

In regards to the more serious remark below, is this something you actually tested/observed:
“But if you don’t limit what traffic is allowed between branches, you could run into situations where some process behind the scenes (network vulnerability scans, Windows Update peer-to-peer sharing, roaming users that have ALL the printers installed locally on their laptop for ALL branches, etc) will trigger the creation of dynamic VPNs between ALL the branches. The Edges will run out of memory or CPU and start to drop traffic, OSPF will drop routes intermittently as the OSPF process competes for resources, etc.”

It seems to imply that the creation of the dynamic tunnels would result in the edge running out of resources.
I ask this because the VMware SD-WAN Edge (and by the way they are not called gateways) have a built in protection of how many dynamic tunnels it is allowed to build. Actually this solution makes it easier to scale the solution as it does not require a full-mesh and can work in a hub-spoke manner , which allows smaller branches to use a more cost effective HW. Some solutions out there actually struggle to scale as they can only support full mesh.

I did read the remarks from another blogger around this topic, with some more harsh words, and though I won’t claim to know what the old DMVPN issue was… I would expect (and seen) that any solution being purely ASIC based or SW based will have limitations on how many process it can run at the same time. So sizing is part of any design imho.

Also curious on the remark: “however I did not have a BGP router available at the time ” , wondering what prevented you to spin up a virtual router with such capabilities, that’s actually something I love about SDN and SW based solutions… I don’t have to depend on hardware to run any of my tests.

I am relatively new to the VMware SD-WAN world which is now over 7years (counting Velocloud times), so excuse if I got something wrong 🙂

Cheers,
Vladimir

Reply
1. douglash@gmail.com says:
  
  February 12, 2020 at 9:10 am
  
  Vladimir,
  Thanks for the reply!
  I agree, VeloCloud is very ‘plug-and-play’… almost no fine-tuning needed as the default policies are pretty effective. However I’m used to other SD-WAN solutions (Silverpeak in particular) where it comes devoid of any predefined policies, and leaves all of the configuration up to the implementer. It’s not very difficult, but very different to VeloCloud. VeloCloud excels for customers who need a simple solution to ‘Just Work’. SilverPeak excels for customers who have unusual network designs, unique requirements, or need full WAN Acceleration (global presence, satellite, etc.).
  
  Also I consider the end-user of SD-WAN not the actual users of the service, but the people that are managing the system day-to-day. Users are just happy that it works, and both solutions fulfill that need. It’s when things get weird and stop working as excepted (or things are working just fine but there’s an issue with one of the WAN links), that having the ability to see exactly which path the traffic is flowing down, what’s the real-time performance of each underlay tunnel, etc. immediately, without needing to call TAC, is very useful. However this level of troubleshooting generally requires a network engineer to understand what’s going on… for customers that don’t that that expertise in-house, the recommendation would be to use something that works well off-the-shelf (VeloCloud), or utilize a service provider to be that expert.
  
  The environment I was ‘testing’ against (taking a bit of time during deployment, not really doing a proper test in a lab environment) runs ‘3.3.2’, which I believe is fairly current… however I don’t have any control over the version as it’s managed via a service provider.
  I just tested the ‘pinging’ of the LAN-side of the VeloCloud from a remote site:
  Version 3.3.2: pinging an interface (routed) from another site was successful
  Version 3.2.1: pinging an interface (VLAN) from another site was not successful
  Version 3.1.2: pinging an interface (VLAN or routed) from another site was not successful
  
  The checkbox ‘ICMP Echo Response’ is checked in all three versions. Perhaps this was a bug that has been fixed in 3.3.2?
  
  On the ‘multiple WAN overlays on a single interface’ item… to be fair I didn’t spend a ton of time trying to get that to work, but it does seem needlessly complicated to setup any WAN that’s not using a raw routed interface. Again, I’m comparing to other solutions where I can simply create a new sub-interface on ANY available physical port, declare that it’s a WAN connection, and slap the needed VLAN tags and IP info on it. Not sure why VeloCloud makes it more difficult than it should be… I have to use a routed interface, I can’t share a routed port easily/intuitively, and I can’t convert the multitude of switched ports on the device into routed ports to use for WAN connections (at least on the Edge 520, I understand other larger units can possibly do so).
  
  On the dynamic tunnels between sites, I may be incorrectly attributing the issues we had to branch-to-branch limitations… We did see a high amount of intra-branch traffic that seemed to cause the hub to run out of resources and start dropping OSPF. The root-cause to the performance issue was the undersized Edge at the hub site, undersized not on the amount of traffic it could handle, but the number of tunnels it had to support. We temporarily addressed the issue by enabling branch-to-branch tunnels, which worked for a bit, but ultimately the hub site was upgraded to resolve the issue. I believe that the Hub was unable to handle the amount of traffic (probably more on the number of connections being established vs bandwidth consumed), and even with branch-to-branch tunnels enabled, the hub continued exhibiting performance degradation. The motto of the story: Size the edges appropriately, keeping in mind all the limitations each Edge has.
  
  On the missing BGP router… since this wasn’t in an actual lab environment, I didn’t have the ability to get it connected up to a network that could handle BGP at the time. Ideally I would have access to virtual Edges to actually test in a true lab environment and work out the kinks before installing into production… but not everyone has a production environment that’s separate from their test. 🙂
  
  I am curious how Engineers who want to test/train on VeloCloud gain access to the orchestrator/edges to be effective in testing… ideally something like VMUG Advantage would provide that.
  
  Thanks for the response, I really enjoy getting feedback, especially on a new/fresh blog like this. 🙂
  
  Reply
Vladimir F. de Sousa says:

February 13, 2020 at 2:59 pm

Hi Douglas,

me again 🙂 Last words…

I still believe VMware SD-WAN is the best fit for customers even in complex designs, we have integrated it in many networks doing complex setup where multiple paths were available providing high-availability with alternative paths via underlay in case of failures , for example.

In regards to multiple overlays out of 1 physical interface, indeed as long as its a route capable interface , VMware SD-WAN can run multiple overlays. The particular model you mentioned 5×0 comes with a built-it marvel switch, which is a L2 switch and it is meant to connect LAN side, they are actually called LAN 1, 2, etc 🙂

In regards to the issues of sizing, I understand now that the issue was a HUB side , which indeed must be properly sized as it wont stop it to receive static tunnels from VCE branches, it will alert you though that it exceed it max supported tunnels. And we do have the ability to monitor CPU/Memory/tunnels/flows usage on the VCEs.

You can easily build our VCEs in “testing” platforms like eve-ng (thats my main LAB btw) 🙂 and spin a virtual routers (vyos) and have a lot of fun learning VMware sd-wan 🙂

Closing out…. and again congratulation on passing your network certification 🙂

Cheers,
Vladimir

Reply

3 thoughts on “VMware VeloCloud Edge configuration quirks”

Leave a Reply to Vladimir Franca de Sousa Cancel reply