pointer

Category Archives: VoIP Troubleshooting

Are Your Users Complaining? Pinpoint the Reason with OmniPeek

Try as you might, making users happy with network performance is a difficult task. Sometimes problems arise with applications and sometimes with the network itself. Either way, it is your job to make the end user experience the best that it can be.

Luckily, if you are using OmniPeek, problems of any type, whether application or network, or common or rare, can be easily remedied. Let’s take a look at some common user experience issues and how OmniPeek can help to identify the root cause and guide you towards a permanent solution.

Network Downtime
Fortunately unplanned network downtime is a pretty rare occurrence nowadays, but if it does occur the response must be instantaneous. With OmniPeek you can immediately assess the scope of the outage, from a few specific users to an entire subnet. If distributed, 24×7 analysis is in place, you can rewind the network to see exactly what was going on when the outage occurred, providing the best clues possible for determining the root cause, which in this case is probably equipment failure somewhere in the network path of the effected users.

Not Enough Bandwidth? Find the Hog
Bandwidth issues typically arise not because of a lack of bandwidth, but because a user or users are consuming an abnormally high amount of bandwidth. This is becoming more common with the wide availability of video streaming sources, particularly those that are not work related. The Compass dashboard in OmniPeek is an easy way to isolate bandwidth hogs, allowing you to identify not only the user but the type of traffic, providing the ammunition you need to ensure that network traffic is strictly business related.

VoIP Quality Issues
VoIP is the most commonly used real-time protocol on corporate networks. Given its real-time nature, it is extremely sensitive to network problems like too much latency, dropped packets, and jitter, much more so than “regular” network traffic. For real-time data to be useful, it must arrive in order, and within a few hundred milliseconds of being sent, otherwise it is no longer “real-time” and doesn’t fit into the overall conversational flow. Problems with real-time data will continue to grow as more corporate video is transmitted over IP networks, and as more wireless networks are used for the last 100m of data delivery (voice over Wi-Fi, or VoFi).

From a network perspective, VoIP, VoFi, or video over IP are just data on the network. In order to identify problems with real-time traffic, you first need to isolate the traffic, while still seeing it in the context of the overall network. To do this, you can use the Voice and Video dashboard in OmniPeek to see how real-time traffic is coexisting with the rest of the network. Then, the Calls and Media views will allow you to see a more detailed analysis of the packet-by-packet performance of the real-time flow, including detailed analytical metrics and a bounce diagram so you can pinpoint exactly where the problem is, and compare it with network activity to correlate real-time transport problems with overall network usage.

If you are experiencing another kind of reoccurring issue on your network, please leave us a comment and we’ll address best practices for remedying this issue.

Q&A: How Can I Ensure the Best Application Performance?

The network is often blamed for poor application performance, even if the network is not the culprit. Network engineers therefore need to know how to determine the cause of the problem, even if it’s in the application itself. Below, you’ll find the most common questions we get from our clients on this topic, and how you should address them – whether you’re working with wireless or in a cloud environment.

Q: How do you know if it’s your network or your application?
A: This is the first question you should figure out when users are complaining about a poor application response time. We’ve gone into great detail on this subject, but here are some initial indicators to help you prove that the network is not the culprit:

Packet-level monitoring will show you the conversation between a client and a poorly performing application. If a user request is followed by a quick network acknowledgement (ACK) but a delayed data response, it means it is an application issue. If the ACK is delayed or missing, then the network is to blame. You can get this information clearly visualized in OmniPeek’s Compass dashboard, using the “2-Way Latency” setting, which displays the network and application latencies, with drill-down on a per-node or even per-flow basis.

If you’re streaming Expert events into a log management system, network issues are shown through slow acknowledgements, TCP slow segment recovery, slow and frequent re-transmission, and low throughput. Application problems manifest in slow application response times.

Q: Does application monitoring change in a virtual environment?
A: While network monitoring may be more difficult in a virtual environment due to the introduction of overlay networks and virtual switches which often aren’t controlled by the network team, the fundamental analysis techniques are still valid. A capture only has to be in the packet path between the client and the server in order to get diagnostic info and answer the basic question: is the problem in the network or in the application?

Q: How should I address application problems if they are housed in a cloud environment?
A: Cloud is generally hostile to packet capture, since there’s no network visibility if you don’t control the network. In this environment, we recommend focusing on the end user experience. If there are complaints or concerns about the application performance, capture on a client machine to see what the traffic pattern reveals. We’ve found that many of our customers appreciate using the OmniPeek Remote Assistant capture agent, as it’s a lightweight capture tool with a simple user interface to capture packets from a Windows client. The encrypted capture file can then be sent back to the network team for analysis.

Q: How should I handle issues with real-time applications like VoIP?
A: Sadly, with VoIP it often is a network problem. Latency is often the root cause. Sometimes it’s a transitory problem, like routing reconvergence after a link goes down (or up). Sometimes, the packets are simply routed through inappropriate equipment, like a proxy which doesn’t do any VoIP analysis, but which still adds latency.

There is a pair of tools we recommend for VoIP. First, use the built-in VoIP analysis in OmniPeek Enterprise to measure the MOS scores and determine how widespread the problem is. Second, use Multi-Segment Analysis (MSA) to capture at multiple points simultaneously in the packet path, to determine whether there are any significant sources of latency in the network, and where they are.

Q: How do you know if application performance is sufficient?
A: This is subjective to the end user. We usually suggest examining the application response time. This measures the time it takes an application to respond to a specific user request on a per-request or per-flow basis. The Expert dashboards in OmniPeek and OmniEngine Enterprise will give you these numbers very easily.

Our products also assign one of three basic levels of performance: satisfied, tolerating and frustrated. We dive into deeper detail about how you should measure and report this here.

Q: What if my application has a bug in it? How do I know and how should I solve the problem?
A: Once you’ve demonstrated that there is a problem in the application, the next steps may not be obvious to the sysadmins or developers. Most modern applications are highly modularized, split into multiple layers across many different servers, and if a back-end service is slow to respond, that delay will propagate all the way to the user. Packet capture can provide insight here as well: use the application performance analysis techniques on a capture taken on the front-end server to see whether there’s a dependency on other servers, and what the application response looks like from those remote systems. This works even if the connections are SSL or TLS encrypted, as it will be clear which packets are simple ACKs and which are application-layer responses. Repeat until you find which server in the distributed application is causing the major slowdown.

As always, please let us know if you have any additional questions!

Q&A: How IT Trends Are Affecting the Network

We get a lot of questions from our customers about how they should prepare themselves for different technologies (cloud, 10G, etc.). For this blog we wanted to answer some of the common questions that we receive from our customers – mainly network engineers/administrators – regarding both specific networking trends as well as larger technology trends.

We have Jay Botelho, Director of Product Marketing at WildPackets, answering these questions. He’s been in the networking business for over 25 years. If you have any additional questions for our networking guru please be sure to let us know and we’ll address them in our next Q&A blog.

Q: What are some best practices for troubleshooting in a virtual environment?
A: Virtualization creates “blind spots” in your network, making it difficult to monitor with traditional techniques, i.e. spanning a switch port from a physical Ethernet switch to collect packet-based data for complete root-cause analysis.

Here’s a common scenario. A user is experiencing abnormally long delays while working with a specific application. You know from your network architecture that the app is running on a VM, but you’re not the application engineer so you’re not familiar with all of the nuances of the application’s operations (what data sources it accesses, under what conditions, etc.). You’re able to start a packet capture session on a switch just upstream from the virtual server running the app, and after filtering and watching the user connection to the app you can confirm long delays for some operations, but it is clear the delay is NOT between the user and your capture point. The delay is within the VM, in your blind spot, where communication between the application and the database are virtualized on the same VM.

To address this issue, you must collect data from  the virtual switch(es) within the VM in order to get visibility between the application and the database. There are several techniques to achieve this. First, you can use OmniVirtual, a packet-capture software probe specifically designed to run on VMs. You will need to allocate space on the VM to run OmniVitual, just as you would any other application. Once running, OmniVirtual will have access to all data crossing any of the virtual switches on the VM.

A second alternative is to use a virtual tap, available from several tap vendors. These virtual taps install at the VM layer, acting like a traditional tap, providing access to data crossing virtual switches on the VM to a host of solutions, including network analysis and troubleshooting solutions like the Omni Distributed Analysis Platform.

Q: How will migrating to a public Cloud affect my job?
A: There’s some very good data available from industry analysts that predict you will be busier as your company migrates to the Cloud, so don’t worry about your job! But your role will change from managing not only your own infrastructure, but overall service availability and performance in the Cloud as well. Cloud computing merely shifts your application servers from your facility to a third party. Issues like bottlenecks, bandwidth hogs, and unauthorized protocol usage will still adversely affect application traffic. Thus, more diligence must be applied in monitoring application performance and making sure that your service provider is living up to its promises.

Q: Why do issues with VoIP continue to persist?
A: This is a question that we consistently get from both potential customers and customers looking to deploy VoIP. The core issue is that networks are really not that friendly to real-time data (RTP, or real-time protocol, which is used by VoIP and video). Most networks are optimized to carry TCP/IP data traffic, which is much more tolerant of latency, packet loss, and jitter than is RTP. To compensate for this, networks require additional configuration for VoIP, including the use of Quality of Service tagging – QoS (at a minimum) – or even dedicated VLAN or MPLS segments to segregate and give priority to RTP traffic. If you either have or are planning a transition to VoIP, be sure you are using a network analysis solution that treats VoIP like any other data type on the network, since that’s exactly what it is. Often times VoIP problems spike during times of heavy network usage, so you need a solution that can see everything at once and allow you to correlate the activity of all traffic on your network simultaneously.

Q: How will my network analysis needs change as we roll out 10G?
A: 10G is a game-changer for network analysis and troubleshooting. The still oft-used “break/fix” method of network troubleshooting – where packet-based network analysis is only performed after a problem is reported – is no longer effective at 10G. With more data consolidated through fewer resources, the number of problems per segment increases, and the increased network speeds make it far more challenging to try to reproduce problems, or wait for them to happen again. At 10G you need to monitor and record packet-level data on an ongoing basis, arming yourself with a recording of all activity on these highly-utilized segments. If real-time monitoring indicates a negative trend, or if problem reports are rolling in, you can simply “rewind” the network to the troublesome period of time and analyze exactly what was going on. No waiting for it to happen again, and no need for Herculean efforts to reproduce the problem. You have all the data you need to solve the problem – immediately.

This wraps up our first Q&A session. Please keep your questions coming. We’re always up for a challenge, and let’s face it, we picked the softballs this time around…