Moving Cisco UCS to 10G Interfaces

We initially implemented our Cisco UCS chassis and FI’s with 4 port channels each with 2x1Gb interfaces connected to our Cisco Core Switches. Now we are in the process of moving from a dual Cisco Core to a Juniper Virtual Chassis Core, more on that later in another post. Part of getting the new core was finally getting 10G for our network. We had been surviving just fine with our current network connectivity, but figured it wouldn’t hurt to get 10G and connect whatever we could to it.

What I could not find searching around the internet was how the UCS FI’s were going to handle the additional links and how the traffic would move over. I was afraid it would do some sort of Spanning Tree blocking and not allow them to pass traffic. However I realized after checking the existing links that I had two from each FI and both of them were actively passing traffic and neither was in a blocking state.

I then went ahead and started to plan the turnup of the links. For us the majority of our servers are sitting in the UCS environment from bare metal linux and windows machines, to our 500 Guest VMfarm. With so much crucial infrastructure we wanted to make sure we didn’t have any downtime or lose any traffic during the transition. So as part of the planning I built out a python script that would ping a list of known addresses to ensure they were all up and on the network as each part of the plan was completed. I wanted it be useful across any platform so that the code was reusable, so I made some allowances for the different versions and the unique requirements of each OS. The only requirement is a file called hosts.txt with your ip addresses in it you want to ping. Its multi threaded so it will run a lot of the pings at the same time and complete it as soon as possible. Then you just need to go through the output and look for anything that is failing.

 

 

#!/usr/bin/python
import sys
import os
import platform
import subprocess
import threading
plat = platform.system()
scriptDir = sys.path[0]
hosts = os.path.join(scriptDir, 'hosts.txt')
hostsFile = open(hosts, "r")
lines = hostsFile.readlines()
def ping(ip):
    if plat == "Windows":
        ping = subprocess.Popen(
            ["ping", "-n", "1", "-l", "1", "-w", "100", ip],
            stdout = subprocess.PIPE,
            stderr = subprocess.PIPE
        )
    if plat == "Linux":
        ping = subprocess.Popen(
            ["ping", "-c", "1", "-l", "1", "-s", "1", "-W", "1", ip],
            stdout = subprocess.PIPE,
            stderr = subprocess.PIPE
        )
    if plat == "Darwin":
        ping = subprocess.Popen(
            #["ping", "-c", "1", "-l", "1", "-s", "1", "-W", "1", ip],
            ["ping", "-c", "1", "-s", "1", "-W", "1", ip],
            stdout = subprocess.PIPE,
            stderr = subprocess.PIPE
        )
    out, error = ping.communicate()
    print out
    print error
for ip in lines:
    threading.Thread(target=ping, args=(ip,)).run()

For the migration we took the subordinate FI and brought up the 10G interface as we watched traffic flow over it we then ran the ping script a couple of times. We then started to shutdown the trunk interfaces each time running the ping script looking for issues. After we had the Subordinate FI moved over to the 10G and new core we then then did the same with the Primary FI. I was happy to find that at no time did we lost any connectivity to our hosts/guests and that everything went smoothly.

3750 Unicast Flooding Issue

Since I ran into this issue and wasn’t really able to find anyone posting on this I thought I should put something together for anyone else that runs into it. I had an issue with a stack of 3750x switches where there was unicast flooding to all of the ports in the same VLAN. While doing research I came across suggestions of asymmetric l2 routes and timeout values for the arp tables and tcam table overruns. My issue turned out to be none of these, the arp timeout values where all increased and that didn’t solve the problem. My network if farily simple with a collapsed core and l2 asymmetric routing wasn’t the issue. The tcam tables were different not being overrun on this switches as it can handle 8K arp entries and I am no where near that.

So what did that leave me with? An issue where the ARP tables of all members of the stack were not getting updated in a timely manner. As seen below with the following command:

remote command all sh mac add count | i Total

Switch : 3 : (Master)

———————

Total Mac Addresses    : 152

Total Mac Addresses    : 585

Total Mac Addresses    : 39

Total Mac Addresses    : 381

Total Mac Addresses    : 384

Total Mac Addresses    : 22

Total Mac Addresses    : 28

Total Mac Addresses    : 178

Total Mac Addresses    : 0

Total Mac Address Space Available: 6402

Switch : 1 :

————

Total Mac Addresses    : 152

Total Mac Addresses    : 585

Total Mac Addresses    : 162

Total Mac Addresses    : 22

Total Mac Addresses    : 39

Total Mac Addresses    : 381

Total Mac Addresses    : 384

Total Mac Addresses    : 28

Total Mac Addresses    : 0

Total Mac Address Space Available: 6418

Switch : 2 :

————

Total Mac Addresses    : 152

Total Mac Addresses    : 585

Total Mac Addresses    : 165

Total Mac Addresses    : 22

Total Mac Addresses    : 39

Total Mac Addresses    : 381

Total Mac Addresses    : 384

Total Mac Addresses    : 28

Total Mac Addresses    : 0

Total Mac Address Space Available: 6415

Switch : 4 :

————

Total Mac Addresses    : 152

Total Mac Addresses    : 585

Total Mac Addresses    : 39

Total Mac Addresses    : 381

Total Mac Addresses    : 384

Total Mac Addresses    : 22

Total Mac Addresses    : 28

Total Mac Addresses    : 140

Total Mac Addresses    : 0

Total Mac Address Space Available: 6440

After many hours of troubleshooting with TAC, they finally came to the conclusion that we were hitting bug:

CSCut64281    Ports on Member of the stack takes long time to learn/age MAC addr

This was only evident in the 15.1x code train, this issue didn’t exist in 15.0 which is why some of my older switches weren’t seeing it. Only the brand new shiny ones I had installed last year. The fix was finally available in the last couple of months in 15.2.(2)E3. I finally finished testing the release on some slightly non prod switches and then decided to roll out to my campus, now I am seeing the following in the upgraded switches:

Switch : 3 : (Master)

———————

Total Mac Addresses    : 142

Total Mac Addresses    : 536

Total Mac Addresses    : 44

Total Mac Addresses    : 70

Total Mac Addresses    : 30

Total Mac Addresses    : 21

Total Mac Addresses    : 27

Total Mac Addresses    : 150

Total Mac Addresses    : 0

Total Mac Address Space Available: 7151

Switch : 1 :

————

Total Mac Addresses    : 142

Total Mac Addresses    : 535

Total Mac Addresses    : 44

Total Mac Addresses    : 70

Total Mac Addresses    : 30

Total Mac Addresses    : 21

Total Mac Addresses    : 27

Total Mac Addresses    : 151

Total Mac Addresses    : 0

Total Mac Address Space Available: 7151

Switch : 2 :

————

Total Mac Addresses    : 142

Total Mac Addresses    : 534

Total Mac Addresses    : 44

Total Mac Addresses    : 70

Total Mac Addresses    : 30

Total Mac Addresses    : 20

Total Mac Addresses    : 27

Total Mac Addresses    : 150

Total Mac Addresses    : 0

Total Mac Address Space Available: 7154

Switch : 4 :

————

Total Mac Addresses    : 142

Total Mac Addresses    : 535

Total Mac Addresses    : 44

Total Mac Addresses    : 70

Total Mac Addresses    : 30

Total Mac Addresses    : 21

Total Mac Addresses    : 27

Total Mac Addresses    : 146

Total Mac Addresses    : 0

Total Mac Address Space Available: 7156

While not perfect it definitely seems to be a lot better than the previous reports. I keep looking for the bug to be posted on Cisco’s site, but it is still private at this point.

Troubleshooting Websense as Proxy for site access

I recently had to troubleshoot a problem with a client going through Websense as a proxy and trying to gain access to a site. The site has at https://somesite.com:11001. Every time I would go to the site I would just get a “Page could not be displayed”. I then wen through and started troubleshooting from the Websense side and couldn’t see anything in the interface itself, so I went to the log server and then stopped the logging service and ran it from the commandline with just the client I was testing with. However this didn’t even show that there was a hit from the client. I then had to go to the next level and troubleshoot with a packet capture and Wireshark. Once I was able to capture the traffic I could see that Websense was returning an error that the browser wouldn’t display. The issue came down to using https on port 11001 which wasn’t allowed in the Content Gateway on the Websense appliance. Once I added that I was able to browse successfully to the site and have it show up in the log server.

So below I have summarized the steps for someone else needing to do this type of troubleshooting.

How to use the Websense testlogserver to troubleshoot problems and limit the information that is seen:

  1. Log into the logging server
  2. Stop the “Websense Log Server” service
  3. Go into the c:\program files (x86)\Websense\Web Security\bin folder and run the testlogserver.exe -onlyip (ip address you want to see)
  4. You can now surf the site from that machine and see what errors are showing up in the log server to help determine the problem.
  5. If you need to go to another level then run a packet capture from the machine using Websense as an explicit proxy in your browser. You can then limit the capture to just the Websense IP.
  6. Once you have gone to the site you can then look at the packet capture and search for “http contains (site you are going to)”.
  7. You should be able to then decode the http stream and see all of the headers and information returned. This should help you in troubleshooting the issue.

Setting up Ansible on OS X for deploying VPN Configurations

While my day job is as a Network Engineer I also moonlight as a Network Engineer and consult for a couple of people. Most of that work involves VPNs for remote office connectivity. I’ve been able to get most of my clients to standardize their VPN devices on Cisco 5505’s. While the configurations are a bit different most of the items are the same. So this lends itself to a templating system. I’ve also found that at times when a client calls with frantic call saying that a device has died they find it reassuring when I can immediately dial in and upload a configuration and have them back up and running in a quick time. I went through Kirk Byer’s Python course and then got his emails about setting up Ansible. I thought that this might be a great way to build a standard config per client and keep their configurations in a template file so that I can regenerate their configurations as needed. I am also running this on my macbook pro so that I have it wherever I am.

Kirk put together some wonderful walk through’s of setting up Ansible and getting it to work:

https://pynet.twb-tech.com/blog/ansible/ansible-cfg-template.html
https://pynet.twb-tech.com/blog/ansible/ansible-cfg-template-p2.html
https://pynet.twb-tech.com/blog/ansible/ansible-cfg-template-p3.html

Since I wanted to use a password for the SSH connection, I did have to load SSHPASS for OS X:

http://thornelabs.net/2014/02/09/ansible-os-x-mavericks-you-must-install-the-sshpass-program.html

I took my standard config and put it into the templates directory after that I then modified it to handle removing DHCP and then variabilize key aspects of the config so that I could put those into the vars/main.yml and customize per site.

The issue I ran into was when trying to run the Playbook, Ansible didn’t like the PSK for the L2L VPN.  I found that it needed to be enclosed in single quotes in the /vars/main.yml file. This way there wasn’t an issue with the characters in the PSK and the config would generate correctly.

After this it is a simple copy and paste and the vpn device is ready to go.  The other thing that this helps with is when doing upgrades from older devices such as 501’s.  I can now just modify the template that is used and I get the new config for the 5505 with very minimal error checking and a much better standardized config.

Mac, Python, paramiko, all in a days work

I am trying to learn Python as I think it will be good for my day job. I bought a couple of books, but I am someone that learns by doing. I found some good scripts out on the internet that I wanted to modify and make use of. However I am also a mac user and so I wanted to be able to run these scripts on my Mac so that when I wanted I could run them from where ever I might be. I do on occasion travel to sites and do some extra curricular activities that might require this ability. So the mac has Python pre installed, it’s version 2.7.5, which seemed sufficient for my needs and what I wanted to do. The script I wanted to play with needed the paramiko module. I was able to download it and extract from here:

https://github.com/paramiko/paramiko

That was easy, however to install it said if I had setuptools would be best. So I found this site:

https://pypi.python.org/pypi/setuptools#unix-including-mac-os-x-curl

And was able to find a command to download and install setuptools.
***Make sure you are root, you will have a much better time of it.***

curl https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -o – | python

So that installed correctly, however when I went into python and did an “import paramiko” I was told I needed a crypto module. I then went out and found this:

https://pypi.python.org/pypi/pycrypto

Downloaded it and of course I couldn’t use setuptools for it, it needed to be built and then installed. So that required me to get Xcode 5.1 for the cc compiler and load that on my machine. That was straight forward enough. So after the Xcode install I then ran:

python setup.py build

But I was getting this error:

error: command ‘cc’ failed with exit status 1

Turns out there is an issue with Python and Xcode 5.1. The fix for that is to run the following before doing the build and install:

export CFLAGS=-Qunused-arguments
export CPPFLAGS=-Qunused-arguments

Once that is done you can then go into the pcrypto folder and run:

python setup.py build
python setup.py install

Now you have everything you need to use paramiko to ssh into a cisco device from a mac and run some commands or do whatever it is you want.

I did find one other thing that is needed and that was as part of the connect string for paramiko, I needed to specify “allow_agent=False,look_for_keys=False” as part of the string. If I didn’t then I was getting password errors on the cisco switch I was testing with.

ssh.connect(‘x.x.x.x’, username=’name’, password=’password’, allow_agent=False,look_for_keys=False)

All in all it was a very educational day and I think some hours well spent. I am now going to take my scripts and look to put everything into variables and also specify some lists so I can run it against multiple machines.

Using Mcafee Enterprise Security Manager to monitor Anyconnect Groups

We use Anyconnect for our Remote Access solution and one of the issues I have is with other admins not putting people into the right groups or not putting them into groups at all.  So then what happens is they get stuck into the DfltGrpPolicy, which is definitely not where I want them since that doesn’t have the customization for each of the different groups.

Since Mcafee Enterprise Security Manager is monitoring my ASA and all of the logs are going to it, building an alert to notify me when people are in this group shouldn’t be an issue.  Here is what we are going to do:

Identify the group, in this case I want to know when people get assigned the DfltGrpPolicy policy.

Now what we are going to do is build an alarm that fires when it sees DfltGrpPolicy in the object field.  Since no one should be pulling this group for any reason anyone that is in it is misconfigured and needs to be moved to the appropriate group.

1. Go to System Properties and then click on Alarms

2. Choose an assignee and give the alarm a name.

3. Choose, Type: Field Match, Field: Object, Value: DfltGrpPolicy, Select your device of your Cisco ASA.

4. Choose your actions, in my case I am going to have it email when it sees that.  I am also going to customize the template so it only sends me the relevant information, such as user misconfigured and group they are in.

5. I leave Escalation blank

6. Click Finish and you are done.

Now sit back and admire your work and you should now be alerted for people misconfigured.

Monitoring Domain Admin Password Changes with Nitro/Mcafee Enterprise Security

We had a requirement to monitor when Domain Admins changed their password so that we could pass our PCI Audit for this year.  Mostly the issue came out of the fact that a hacker could just as easily change a domain admin password as they could create a new account in domain admins.  I was already monitoring domain admins for new accounts getting added/removed, this just seemed like the next step.

I came up with this method to be able to monitor our Domain Controllers with Nitro and then send me a notification/report.

1. Create a Watchlist
call it whatever you want, but make it a destination user type and then choose an assignee.

2. Input all of your Domain Admins into the Watchlist

3. After creating the watchlist, then go into Alarms and create a new alarm.
Give it a name
choose an assignee

4. Go to the Condition tab and choose “specified event rate”, set the event count to “1”, and choose the time frame as 10 minutes.  This is needed to trigger the alarm when you need it to fire.

5. then click on the filter icon, for your query filter choose the following:
signature id: 43-211006270,43-211006280,43-263047230
destination user: WL:(name of watch lists)
Under select a device, choose all of your domain controllers.  We have two domains so we are monitoring several machines.

6. Now here is where the magic comes in and the best alerting you can setup.  Under actions, choose Generate Report and then click configure
Here you are going to create a new report that lists the userid field and any other information you want.  You will also choose your email recipients here so that you can be notified when a domain admins password has changed.  Since the parameters from the alarm will not be passed to the report you will need to choose a few things for the filter at the bottom to keep it to what you are looking for.
signature id: 43-211006270,43-211006280,43-263047230
destination user: WL:(name of watch lists)
set the time range to the last 10 minutes

7. click Finish, you now have an alarm that will be triggered when a domain admin changes their password.

Turn it off when you are done with it

I learned a lesson that I thought I knew and thought I actually exercised.

“Turn off whatever it is if you aren’t using it.”

I like most want to know what is going on with my network at all times.  I especially want to know if someone is making a change to a key piece of infrastructure, not to mention it’s nice to show to the auditors when they ask.  I have an alert setup on our Nitro appliance that notifies me when someone is making a change to our firewalls.  Important since this device is what I use for segregation on my network and to keep my credit card data safe.  Nitro does this by monitoring the syslog coming out of the firewall and looking for a particular message which relates to a signature ID.  When it sees signature id it then sends a message to me and the other firewall admin so that we can look at each other and say yup we made that change.

I was noticing though over the last couple of weeks that when I would make a change I wasn’t always getting notified.  I would get some alerts but not others.  So I started with looking at the Nitro appliance to see if it was having a problem.  As I was debugging it I noticed that it wasn’t getting all of the messages to be able to alert off of.  Information is missing and I needed to find it.

I then started looking at my syslog config on my ASA, here it is for reference:

logging enable
logging console informational
logging buffered debugging
logging trap informational
logging host inside x.x.x.x
no logging message 106015
no logging message 313001
no logging message 313008
no logging message 419002
no logging message 106023
no logging message 710003
no logging message 106100
no logging message 302014
no logging message 609002
no logging message 609001
no logging message 302018

Nothing out of the ordinary or so I thought.  As I was using Google to look at stuff I came across some messages about the logging queue limit, which by default is 512.  I decided to look at that and see if that could be causing my issue.  When I looked I saw this:

sh logging queue

Logging Queue length limit : 512 msg(s)
-1123742434 msg(s) discarded due to queue overflow
0 msg(s) discarded due to memory allocation failure
Current 512 msg on queue, 512 msgs most on queue

Definitely not good I am dropping messages and the queue is full.  I thought well may be I can increase the queue and everything will be fine after that.  So I did the following:

logging queue 1024
sh logging queue

Logging Queue length limit : 1024 msg(s)
-1123731334 msg(s) discarded due to queue overflow
0 msg(s) discarded due to memory allocation failure
Current 1024 msg on queue, 1024 msgs most on queue

Immediately the queue jumped up and I was still dropping messages.

I thought okay may be I need to prune out the message I am logging on.  These are minor messages and I hadn’t needed this data in any of my investigations so I decided to kill it.

no logging message 302017
no logging message 302016
no logging message 302021
no logging message 302020

sh logging queue

Current 1024 msg on queue, 1024 msgs most on queue

No change and I am still dropping messages.  I then thought heck, may be I just need a bigger queue, that always solves the problem right.  Bigger is better:

logging queue 4196
sh logging queue

Logging Queue length limit : 4196 msg(s)
-1123742651 msg(s) discarded due to queue overflow
0 msg(s) discarded due to memory allocation failure
Current 4196 msg on queue, 4196 msgs most on queue

That didn’t help out so much.  Immediately I am at a full queue and still dropping messages.  Then I looked back through my config again:

sh run logging

logging enable
logging console informational
logging buffered debugging
logging trap informational
logging queue 4196
logging host inside  x.x.x.x
no logging message 106015
no logging message 313001
no logging message 313008
no logging message 419002
no logging message 106023
no logging message 710003
no logging message 106100
no logging message 302014
no logging message 609002
no logging message 609001
no logging message 302018
no logging message 302017
no logging message 302016
no logging message 302021
no logging message 302020

Hmm, I wonder if logging to the console and buffer are causing my issue.  I am not using them currently and the last time I was troubleshooting I did turn them on.  Could I really have not cleaned up after myself and could this be causing me an issue?  I then did the following:

no logging console
no logging buffered
sh logging queue

Logging Queue length limit : 4196 msg(s)
-1123728024 msg(s) discarded due to queue overflow
0 msg(s) discarded due to memory allocation failure
Current 0 msg on queue, 4196 msgs most on queue

Immediately the queue dropped down and there was nothing in it.  I then moved the queue back down to a smaller number.

logging queue 1024
sh logging queue

Logging Queue length limit : 1024 msg(s)
-1123728024 msg(s) discarded due to queue overflow
0 msg(s) discarded due to memory allocation failure
Current 0 msg on queue, 4196 msgs most on queue

No messages on queue and no dropped messages.  Also all of my test alerts are now working correctly and everything seems to be fine.

Lesson relearned, when you turn on something make sure you turn it off.  Even if at the time you don’t think it will cause you an issue it may come and cause you an issue later.

Dynamic Arp Inspection/DHCP Snooping

In trying to remove Man in the Middle attacks for my network I started looking at Dynamic Arp Inspection(DAI) and DHCP Snooping.  I brought them home to my lab and started to play.  Here is what I read up on to figure out what to do and how to implement:

Research:
http://www.cisco.com/en/US/docs/switches/lan/catalyst3750x_3560x/software/release/12.2_58_se/configuration/guide/swdynarp.html#wp1039773
http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/dynarp.html
http://packetlife.net/blog/2010/aug/18/dhcp-snooping-and-dynamic-arp-inspection/

How to configure:

First turn on DHCP SNOOPING on the switch or switches:

“ip dhcp snooping”
“ip dhcp snooping vlan (vlan to monitor)”

Let it run for a while and populate the dhcp snooping binding database.  This database is extremely important as only the bindings in here will be allowed to arp on the network.

To view the database you can use the following:

“show ip dhcp snooping binding”

Uplink ports and ports that will have dhcp servers will need to have the following put onto the interfaces or you won’t be able to get addresses:

“ip dhcp snooping trust”

Before you turn on Dynamic Arp Inspection you need to track down any dumb switches(switches that don’t support DAI) or hosts with a static IP address.

I recommend removing any dumb switches from the network as they just create security holes and will cause you nothing but problems.

Here is one solution, but I think it is better to get rid of them and easier to deal with:

http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/dynarp.html#wp1077449

As for the static hosts you can do “ip arp inspection trust”, however I think a better method is to create a static ip address to arp address binding with the following:

“ip source binding (mac add) vlan (vlan to monitor) (ip address) interface (interface of host)”

This way someone can’t just remove the static host and take over their ip address.  Another option would be to change all hosts at the Distribution/Access layer to DHCP and put static bindings into the DHCP server for them to ensure their ip addresses don’t change.

To turn on Dynamic Arp Inspection

Identify your uplink ports and use the following command on them:

“ip arp inspection trust”

 When ready to turn on DAI then run:

“ip arp inspection vlan (vlan to monitor)”

By default all ports are untrusted and should have 1 host to 1 network port.

If you have more than one host to a port with a dumb switch you need to use “ip arp inspection trust”, or else the switch will drop all of the hosts on that port.