Você está na página 1de 10

Best Practice: Discovery Troubleshooting

The number one best practice in understanding the troubleshooting discovery is


having a firm understanding of the four main stages of the discovery itself Port
Scanning, Classification, Identification and Exploration. Each follows the other
because of a positive result, just like dominos falling if the line stops at a certain
point its because something didnt happen the step before.

Port scanning

Port scanning is the very first step in the discovery process. In this stage we
scan the defined IPs for TCP and UDP IP ports looking for specific responses,
namely that the port was open or responsive. The six main ports we check by
default are

TCP
135 (epmap) looking for the potential of WMI or Windows systems
22 SSH looking for potential Unix Systems
80/443 HTTP/s Looking for potential web servers

UDP
161 SNMP we send a single OID query (sysdescr) looking for a
response from potential network devices
53 DNS we query the locally configured DNS server to resolve the name
of each IP
137 NetBIOS we query the local domain to also resolve the name of the
IP

In scanning the IP Ranges we look primarily for responses from WMI, SSH and
SNMP. When we see an IP has a state of OPEN for any of these three ports we
move onto the next phase of discovery, classification. If we get no response we
do not list it and if we get some response other then open we identify the same.
Results can be found in the input of the Shazzam probe of your scan.
In the example above you would see that this is most likely a windows system for
it has port 135 in a state of open but well find that out for sure when our
classifier probe fires next.

If you are not getting expected returns or any returns from IPs that you have
been told that are a particular system, your best practice is to walk up the OSI
model and this crosses all stages of discovery
Remember all troubleshooting should be done from the MIDServer host for
a direct comparison to how the MID Server is gathering the information.

Tests you can run to validate network connectivity to confirm your port scanning
results.

1. Use PING to see if you can see the host on the network (ping <host>)
2. If no ping response, use TRACEROUTE to see where traffic might be
stopped. (traceroute <host>)
3. Use TELNET to see if you can connect to any of the TCP ports (telnet
<host> <port>)
4. Use a SNMP scanning tool to see if a potential network device is
responsive

Most likely issues around network connectivity are:

1. Routing perhaps the MIDServer host does not have network access to
the IP ranges you are looking to discover
2. Firewalls
a. Physical Firewalls that protect a large environment such as the
Data Center.
b. Logical- Software firewalls that protect and individual computer
3. (SNMP) Access Control List (ACL) are an IP based list on network devices
that allows communication to a particular target

Resolution of these issues will best be worked with your network teams to better
understand your topology and possible deployment of additional MIDServers to
help keep your network secure and or configuration to networking components to
allow access from your existing MIDServer host.
Classification

Once we see an open classification port (WMI/SSH/SNMP) we then trigger a


classification probe to the specific IP for the specific protocol. This is also the
first time we use credentials for this is our first application query to the target.
We classify computers based on Operating System (OS), Network Devices
based on their capabilities.

Above is a sample return of a classifier probe as you see when successful we will
gather the OS caption matching specific criteria, which would then move us onto
our third phase of discovery identification. But in troubleshooting classification
errors

Windows

If you are having issues with a windows classifier probe returning the most likely
causes are:

1. Credentials
2. Logical Firewall
3. WMI Application performance/availability

Credentials (authentication failures)

To validate your credentials on windows systems you would want to log into your
MIDServers host (if possible) as the same user that Discovery is attempting to
use. The majority of issues are normally credential related.
Logical Firewall (Could not connect to WMI Service)

In WMI communication its important to remember that traffic only initiates on port
135. When two Windows Operating systems talk WMI they actually negotiate
unused high ports to finish the conversation. This presents a problem for
logical firewalls in windows systems for normally they will allow port 135 to be
seen as it is with our port scan probe but block the high ports that we need to
communicate. To overcome this you can configure the firewalls to allow any port
any protocol from your MIDServers host, using the WMI script that can be run
locally or other options such as locking down WMI to specific ports.

It is absolutely recommended that any system configurations should


absolutely be evaluated and determined by your local security policies and
processes.

WMI Application

Its not unheard of to find a particular windows application to stop performing as


expected when all else has been validated restart the OS to eliminate any
residual potential issues (if possible)

Additional troubleshooting resource for windows systems


http://community.service-now.com/forum/3326

UNIX Systems

When a Unix Classify probe is having problems your most likely causes are:

1. Credentials

That the great thing about Unix Systems the most likely issue is just that,
credentials. There are rare cases where your Unix systems may have local
ACLs similar to Network devices or even logical firewalls but these are very few
and far between.

When to help diagnose these issues use a third party application such as putty
(or your preferred ssh client, from the MIDServer host) to confirm the credentials
discovery is attempting to use. You can also enable additional logging to assist
with your troubleshooting http://community.service-now.com/forum/3327
Network Devices

When diagnosing potential network device issues the most likely causes are:

1. Credentials
2. Access Control Lists (ACLs)

Credentials are straight forward be sure that the MIDServer has the appropriate
Read only string in the credentials table (we try public by default).

Routers, powering devices, switches and even some printers however can utilize
Access Control lists that limit which IPs are allowed to make queries to the
device. You want to be sure your MIDServers IP is in that list.

Since you cannot telnet to a UDP port you would want to use a third party query
tool such as iReasoning (or your tool of choice) from your MIDServer host to
validate the proper credentials and ACL Access.

Overall Classification

If you were able to gain access to your system but the device is still not
classifying then you should ensure that you have an actual classifier for that
device that either meets specific criteria. This is most common with Network
gear that may not be out-of-box in the ServiceNow product.

Additional Debugging information can be added to the sys_properties table to


assist in any troubleshooting. By adding a glide.discovery.debug.classification
true/false property with the value of true we will add additional information to the
system log to help with classification issues.
Identification

Once you have a met a classification criteria for any device the next phase of
discovery is identification, identification is the process where before we create or
update anything in your CMDB we ensure we are doing so with the best
measure. One of three things can happen in the identification phase where we
look to find a complete match based on these rules

1. We find no complete match where we will create a CI


2. We find a single complete match where we will update a CI
3. We find multiple matches based on a complete match where we will stop
discovery for that IP and log the event

There isnt much to troubleshoot here besides event #3, multiple matches. This
is where you would go into the CMDB and identify who is creating multiple
matches and of course stop that from happening. The primary culprit is usually
imports from external data sources whose coalescing rules may not be as tuned
as they should be. In our properties section you can add additional logging for ci
identification process that will be written to the discovery log for each IP

Exploration

Once you are at the exploration phase you have full access to your systems and
are now going to be populating data. There isnt much troubleshooting to do in
the phase other then identifying one off probes and sensors that you may have
issues with access to specific unique areas such as directory access, file
systems and unique commands that may not execute due to permission issues.
These can all be found in the discovery log.
Processing

All work is processed in and out of the ECC Queue and can be identified on each
discovery status record.

In the ECC queue can identify the 4 phases of discovery that are executed for
each device. Port Scanning (Shazzam), Classify, identify and exploration. The
ecc queue has two Queue column values and four State column values.

State
1. Output
2. Input

Queue
1. Ready
2. Processing
3. Processed
4. Error

Errors for input and output records are rare. Ouput error records should never
be seen if so, there has been major changes to the moving pieces and parts of
your discovery application. Input error records are not uncommon especially
when creating your own probes and sensors. This usually means the system
couldnt parse the XML return with the sensor script. In any error response the
details section will provide information to where the issue might be
An overall discovery process follows the same path each time for each probe.

An output ready record (probe) is created in the ecc queue,


tasked to the appropriate MIDServer (agent)
Once the MIDServer has collected the request it changes the
record to output processing
When the MIDServer completes the task it creates an input ready
record
The output processing record is then updated to output
processed
The input ready record is then collected by the system and is
updated to reflect an input processing state.
When the data is finished being entered into the CMDB the system
updates the record to an input processed state
Identifying where the processing is taking place is the best way of identifying any
bottlenecks and processing issues. Lets identify some common scenarios

Schedule sits in a starting state


Is the System Scheduler backed up with other system jobs, is it running?

Schedule is active but the shazzam probe is in an output - ready state for longer
then 15 seconds
Check the state of the MIDServer it is most likely down or requires a
restart

Schedule has been running and is active but there is no movement of Output
probes, seems stalled.
Ensure you are giving the appropriate time for the probes to process.
Check the state of the MIDServer could be down or requires a restart

Schedule has been running and is active but there is no movement on the input
probes, specifically the input ready entries.
Confirm the scheduler is running and is not backed up

You can see scheduler stats by going to System Diagnostics > Stats

Confirming this input processing information is valuable to give you piece of mind
any work that needs to be done to resolve will have to be completed by
Customer Support.

Você também pode gostar