Tuesday, March 17, 2009

Basic to Advanced Check Point Gateway Troubleshooting

Don't you just hate when that new business critical application just won't get through the firewall. You pushed the rule, it should be allowed, but your application provider is reporting they are not live with the fancy new application that is going to change the way you do business.

Or those times when all of a sudden, you get the trouble call, and a perfectly good working application that was busy trying to change the way you do business, is suddenly not working. What now?

As the firewall administrator, like it or not, you are usually in a unique place to initiate the troubleshooting process, and ultimately help your company get back to a productive state. Don't let this process turn into a mad race to flip switches and press buttons until someone calls and says it's alright. Troubleshooting your network is a bit of an art, but that is no excuse not to have a process in place to make this the most effective response to an unknown situation. Please remember the following when starting your troubleshooting process.

Troubleshooting is NOT about fixing a problem. It's about finding it.

It's usually very easy to see what went wrong, after the problem is found, and knowing how to fix it should be quickly evident. But for this session, let's not worry about fixing anything. Let's just figure out what needs the fixing.

Depending on how complex this problem gets at some point you will be contacting support for assistance. It is important to make these first steps part of that process, even if you don't plan to escalate anything, and at the first sign of trouble, its time to do some digging before we call anyone.

Question for you, who is most equipped to solve your problem? You? or PhoneBoy?


For those of you that don't know the man, the myth . . the legend, back when large scale firewalls were still a new commodity, information for troubleshooting was scarce. PhoneBoy was a pioneer in community based support, managing a knowledgebase that saved my ass more than once. A searchable, managed knowledgebase is somewhat of a commodity (although still an art to manage well) these days, so his knowledgebase is largely superseded by the Check Point knowledgebase, but back in the day everyone looked to PhoneBoy for the answers.

So I ask again, who would you want troubleshooting your network, PhoneBoy or you? Of course the real answer is you. You can look up the things PhoneBoy knows, but no one knows your environment better than you. . . or do you?

So disaster strikes, new application not working, or worse, and existing application has stopped working. What is your first impulse? It's usually find out what has changed and start flipping it back. Ignore this impulse for now, because only one of two things will happen and both of them are bad. What if you back out the last change, and the problem still exists? What then? Do you start frantically backing out more things? Do you reverse the backout and move forward? Then it gets messier, what if you back out a change and things start working again? I assume you made that change for a reason, and sooner or later, you will have to figure out how to put it back in. Will you know enough to implement the change without impact? Forget about backing out things, first we verify everything.

How you verify is the creative part, I for one like the OSI model. Don't get me wrong, it's not perfect, but I am just using it as a framework. Start from layer 1 and work all the way through, even if you think you find a problem. Note it, and move on. Far too many times I have seen the horrible assumption that we are looking for A problem, when in fact, multiple things need to be addressed. Nothing worse then fixing a problem you have found, only to have the issue stay, or worse, simply manifest in a new and exciting way.

Let's confirm the physical layer. Machine plugged in? Turned on? Sounds like simple questions to answer, but not if your infrastructure is 50km away tucked safely in a datacenter. Use your layer 2 information to verify.

[Expert@sevenof9]# ethtool eth1








I really hate auto-negotiation. I have had many a nights ruined over auto-neg, so check your firewall interfaces closely. And the switch/device it's connected to.


[Expert@sevenof9]# arp -a

This could tell you allot, if you know what it's suppose to look like. Here we go with the process part, let's establish that at some point (and with some regularity) you run something like this:








[Expert@sevenof9]# arp -a > arptable.out

So that you can quickly run something like this to focus on what has changed or is missing from your Layer 2 network.

Expert@sevenof9]# arp -a > checkarp.out
[Expert@sevenof9]# diff checkarp.out arptable.out


By having a point of reference I can quickly see that I have lost layer 2 connectivity with the labrat server.

Do the same for your routing as we move up to layer 3.






[Expert@sevenof9]# netstat -rn > route.out ----hopefully you have run this when things are working!
[Expert@sevenof9]# netstat -rn > checkroute.out
[Expert@sevenof9]# diff checkroute.out route.out
[Expert@sevenof9]# ---hopefully nothing shows up

Ok, we get the basics, lets make sure you have test run the following commands, perhaps created a nice spreadsheet of commands and outputs you expect. Let's list some important things you should be checking.

Verify the firewall is active:
[Expert@sevenof9]# fw tab -t connections -s

[Expert@sevenof9]# fw stat -l

Right about now things are probably starting to heat up for you, and you have verified connectivity into the upper layers using OS level tracing of a tpdump. Many tools exist to analyze it, the format is simple, so search for the source, or destination, or service, and locate the traffic.




You want to watch for the SYN state and follow the sequence of TCP. Any odd RST or failure to respond is more information we have to go on.


Time to make sure the system is not overloaded:

[Expert@sevenof9]# vmstat 5
You want to see VERY low numbers in the si/so and high numbers on the id section. This will tell you if the firewall is working too hard, but it is not the last place we look. Firewall memory management is complex, but we can use a simple tool to understand how the firewall is working, seperate from the OS. 'fw ctl pstat' is the key, but without getting too deep into the complexities of the output, look for something very simple.




[Expert@sevenof9]# fw ctl pstat | grep fail
Allocations: 81983671 alloc, 0 failed alloc, 81827895 free
Allocations: 26148 alloc, 0 failed alloc, 25653 free, 0 failed free
Allocations: 82007088 alloc, 0 failed alloc, 81851120 free, 0 failed free
0 failed stack calls
0 large, 1 duplicates, 0 failures

You should not see any failures. If you do, determine if its HMEM failures, SMEM failures or KMEM failures. If it's HMEM, go to your capacity optimization and increase the default table size. If it's the SMEM, make sure the box is not saturated. If you find KMEM failures, it's probably time to call support.

As we continue our troubleshooting, sooner or later you will end up pulling out the 'fw monitor' tool. 'fw monitor' is not your ordinary sniffer, it lets you see what the Firewall sees, and well as trace how the packet changes as it is processed through the firewall processing. If you are not familiar with this tool, do not use it in production without assistance from support.

Capturing a session could look something like this:

1) Define a large debug buffer: fw ctl debug -buf 32000
2) Turn on debug flags that help better understanding the context:
fw ctl debug + vm conn
3) Turn on drop logging in a way that dumps dropped packets as
well: fw ctl set int fw_droplog_options 0x11
4) Start collecting debug information from kernel, with timestamp
enabled: fw ctl kdebug -T -f
5) Run "fw monitor -o capt_file" to capture the crafted packets

You can check which debug flags are enabled by simply
running
fw ctl debug
Don’t forget to turn debugging off…
fw ctl debug 0
This one de-allocates the buffer and automatically kills the “fw ctl kdebug” process
Most of the time you will be doing this under the direction of support, I can' say that enough, please don't nuke your firewall with a debug and write me to complain, but here is a quick shortcut to try. It may help you gather the information you need to identify where the problem might be, and potentially show that the firewall is acting as expected.

[Expert@sevenof9]# fw ctl zdebug drop
It's a great little tool for finding out the reason for these drops you really don't want, but BE CAUTIOUS. On an already overloaded system (even if you can't see it on the OS) could cause instability in the system. As much as you want to capture all the information you can, using filters in 'fw monitor' will help in sampling traffic without overdoing the load on the system. At either rate, do consider the overall health of the system and the risk of downtime when planning for a debugging session.

Firewall system logs also provide great information in the log directory, if enabled to do so.

[Expert@sevenof9]# cd $FWDIR/log
[Expert@sevenof9]# pwd
/opt/CPsuite-R65/fw1/log
[Expert@sevenof9]# ls *.elg
aciufpd.elg ahttpd.elg avi_del_tmp_files.elg epq.elg igwd.elg stormd.elg
aclientd.elg asessiond.elg cphttpd.elg funcchain.elg mdq.elg su.elg
ahclientd.elg aufpd.elg dtps.elg fwd.elg rtmd.elg vpnd.elg

All of these *.elg files represent a great place to find information about the issue you are searching for, however, they do not, by default, log much information except system startup times. To get more detail you will have to let the firewall know, and this is not something you should be doing all the time.

For example, if you are having a problem getting a VPN tunnel to come up, turning on debugging for the vpn process will provide a wealth of detail, for you and support.

Expert@sevenof9]# echo "" > vpnd.elg <--- this will clear out old entries from getting in the way
[Expert@sevenof9]# vpn debug on <-- turns debugging on, entries written to log
[Expert@sevenof9]#
[Expert@sevenof9]# vpn debug off <-- make sure you turn it off when you are done

There is also information that can be turned on for the fwd (gateway) and fwm (management).
  • fw debug fwd on TDERROR_ALL_ALL=5
TDERROR_ALL_ALL is a value from 0-5, 5 being the most information. Adjust for your situation.
Logs are redirected to $FWDIR/log/fwd.elg

This level of debug is still fairly high level, compared to how deep we are prepared to go. To get into depth, commands like the following will provide the detail support might need.

VPN debug
  • fw ctl debug -buf 10000
  • fw ctl debug –m vpn all
  • fw ctl kdebug -f > VPN_debug &
  • vpn debug ikeon/ikeoff
Logs are redirected to $FWDIR/log/ike.elg
  • vpn debug on/off
Logs are redirected to $FWDIR/log/vpnd.elg

Also check sk32788 for troubleshooting 3rd party VPN connectivity.

NAT debug
  • fw ctl debug -buf 10000
  • fw ctl debug xlatexltrc
  • fw ctl kdebug-f > NAT_debug &
SmartDefense Active debug
  • fw ctl debug -buf 10000
  • fw ctl debug –m fw+conn drop vm
  • fw ctl debug –m CPAS all
  • fw ctl kdebug f > CPAS_debug &
And for SmartDefense Passive inspection
  • fw ctl debug -buf 10000
  • fw ctl debug m fw+conn drop vm tcp-str spii
  • fw ctl kdebug -f > SD_debug &
Doing VoIP debug will depend on the type of VoIP traffic you are protecting.

SIP
  • fw ctl debug -buf 16000
  • fw ctl debug + sip
  • fw ctl kdebug -f > file.dbg
mgcp
  • fw ctl debug -buf 16000
  • fw ctl debug +mgcp
  • fw ctl kdebug -f > file.dbg
skinny
  • fw ctl debug -buf 16000
  • fw ctl debug -m CPAS skinny
  • fw ctl kdebug -f > file.dbg
MSNMS
  • fw ctl debug -buf 16000
  • fw ctl debug + msnms sip
  • fw ctl kdebug -f > file.dbg
H.232
  • fw ctl debug 0
  • fw ctl debug -buf 16000
  • fw ctl debug -m h323 all
  • fw ctl kdebug -f > file.dbg
When planning your debugging session, special consideration must be taken when dealing with a cluster of firewalls. You may be running the debug on the system that is not handling the traffic, in which case you are wasting time looking in 2 or 3 or 4 times the systems. If you can reduce the cluster to a single member, this will simplify the process, and if this clears the issue up, you now know you need to continue with the next section, debugging clusters. Otherwise, if the cluster must remain active, you will need to perform all the previous debug steps on ALL members, including cluster specific debug.

Cluster debug
  • fw ctl debug -buf 10000
  • fw ctl debug –m fw+sync
  • fw ctl debug –m cluster all
  • fw ctl kdebug-f > CLUSTER_debug &
Last, but never least, is to not forget about SecureXL templates. SecureXL operates at the driver level and can hide traffic from the OS when it is accelerating traffic in the firewall kernel.

Check the status with 'fwaccel stat' and disable templates with 'fwaccel off' to ensure they are not part of the problem. Note any changes for support and re-enable with 'fwaccel on', or you may need to debug the process itself (hopefully under the direction of support) with 'fwaccel dbg'.

This is by means a foolproof method of tracing all problems, and don't forget to keep looking around the firewall, into the network and right into the applications that are having so much trouble.

I also don't mean for this to be the complete troubleshooting guide, but it will hopefully get you started. Familiarize yourself with the Advanced Technical Reference Guide (ATRG-NGX.pdf) in sk31221 and rigorous searching of SecureKnowledge will go a long way to deciphering the troubleshooting information you collect.

Happy hunting!

2 comments:

Insurance Quote said...

Whoa. Good Tips byzzz tech talk buzz
Phone Boy to the rescue !
Hey Kelm, you want me to help you make your blog awesome?
Smojoe knows what you need.

Faysal said...

Thank you for sharing your knowledge, after your presentation in LV. Much appreciated.