SoloManager: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Donal
imported>Donal
Line 81: Line 81:
* '''startExecutableCommandPre''' Full path to the program listed as executableName (unless the program's folder has been added to the system path by the installer.)
* '''startExecutableCommandPre''' Full path to the program listed as executableName (unless the program's folder has been added to the system path by the installer.)
* '''outdir''' Specifies the folder which should contain the log files. By default these will be written to the same folder as the configuration file, but another file may be preferable if the user does not have read/write permissions to that folder.
* '''outdir''' Specifies the folder which should contain the log files. By default these will be written to the same folder as the configuration file, but another file may be preferable if the user does not have read/write permissions to that folder.
* '''nResponseRestart''' and '''nResponseReboot''' indicates how many failures must occur before the application is restarted and/or the system is rebooted (respectively). If the Target Application fails after starting successfully, it will be detected by the next normal check, which occur every slowQueryIntervalSeconds seconds. When a target check fails a restart is invoked after  
* '''nResponseRestart''' and '''nResponseReboot''' indicates how many failures must occur before the application is restarted and/or the system is rebooted (respectively). If the Target Application fails after starting successfully, it will be detected by the next normal check, which occur every slowQueryIntervalSeconds seconds. When a target check fails a restart is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRestart seconds. If the restart attempts fail then a system reboot is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRboot seconds. Thus the worst case total elapsed time from the target failing until an action occurs can be roughly calculated by:
(fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRestart seconds. If the restart attempts fail then a system reboot is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRboot seconds. Thus the worst case total elapsed time from the target failing until an action occurs can be roughly calculated by:


  ResponseTime = slowQueryIntervalSeconds + (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponse____
  ResponseTime = slowQueryIntervalSeconds + (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponse____

Revision as of 14:40, 14 June 2010

This page describes the SoloManager program and its usage

Purpose

The purpose of SoloManager is to start a target program locally and to then continuously monitor the target program's availability. The target program responds to tcp/ip queries on a specified port if it is operating normally. If the target program becomes unresponsive for for a specified period of time then the SoloManager can terminate it and restart it, and/or reboot the host computer entirely. Many aspects of the SoloManager program can be configured by specifying values in the SoloManager.ini text file.


Description of components

SoloManager .jar file

This contains the SoloManager program, and a sample SoloManager.ini file. It also contains all necessary Java library files.

SoloManager configuration file

Contains configuration details specifying how the SoloManager operates. See the example configuration file listed below.

Target program

This is an the program we wish to monitor and to ensure is always available. It must expose a TCP port and respond to socket queries on that port.

Wrapper service (optional)

This is an optional component which will start SoloManager whenever the host computer is booted up. It is described in the sections below about starting SoloManager as a Service or Deamon.

Relationships and processing sequence

These components are related as shown in the SoloManager flowchart.

Flowchart.png

Typical process flow

The SoloManager is typically started automatically when the host computer is booted up, usually via the Service and Daemon Wrapper.

Once started, the SoloManager begins by reading in values for all configurable parameters from the SoloManager.ini file. This file can be edited by the user to specify their preferred settings but it must be located in the same directory as the SoloManager jar file. This is where the user specified the name of the target executable which SoloManager will start and monitor, for example.

SoloManager then begins its unending loop where it checks the status of the target program. SoloManager creates a socket connection to the target program and sends a query. If the target program is alive it sends a response which must match what SoloManager is expecting.

SoloManager checks the Target program is alive by:

  1. opening a socket on the target program's port
  2. Sending the parameter "msgToSocket" to the socket and verifying that the first line returned from the socket equals the parameter "expectedResponse".
If the response is not valid SoloManager will repeat this check up to "fastFailCountLimit" times with a pause of "FastQueryIntervalSeconds" seconds.
If the response is valid the check is complete with result success.

If the target check was successful then the failure counter is reset to zero and the loop repeats after a specified pause period of "SlowQueryIntervalSeconds" seconds. If the target check was not successful then the failure counter is incremented. The loop continues until this counter reaches a specified "nResponseRestart" counter value, whereupon SoloManager issues a command to restart the the target program and continue with the loop. If the target program restarts then the next check will be successful so the loop continues normally.

If the restart command does not succeed in restarting the target program then the target checks will continue failing and the failure counter incrementing until it eventually attains the specified "nResponseReboot" counter value. At this point SoloManager issues a command to reboot the host computer and the entire process begins again.

During these operations SoloManager writes status information to a log file and optionally can send e-mail to report events. The log file will be located in the directory specified by "outdir". Its size is limited to the last "logFileMsgCapacity" log messages. E-mailed alerts are optional and are enabled by setting "enableEmailing" = true. In this case e-mail messages will be sent to the specified user whenever:

  1. The SoloManager program starts.
  2. SoloManager is about to issue a restart command for the target program.
  3. SoloManager is about to issue a reboot command to the host computer's operating system.

Dependencies

SoloManager requires the following:

  1. Java version 1.5 or later is available on the host computer.
  2. It must be able to write to a log file on the filesystem.
  3. It must be able to issue a system reboot command (command can be defined within the configuration file).
  4. Operating system may be any of: Linux, Windows (2000, XP, 2003, 2008, Vista, 7), or MAC


Configuration File

The configuration file will almost always need to be modified for the individual application and installation settings. An example file is included below, but a few key settings to modify include:

  • executableName Name of the program to run (usually either Solo.exe or Solo_Predictor.exe.)
  • startExecutableCommandPre Full path to the program listed as executableName (unless the program's folder has been added to the system path by the installer.)
  • outdir Specifies the folder which should contain the log files. By default these will be written to the same folder as the configuration file, but another file may be preferable if the user does not have read/write permissions to that folder.
  • nResponseRestart and nResponseReboot indicates how many failures must occur before the application is restarted and/or the system is rebooted (respectively). If the Target Application fails after starting successfully, it will be detected by the next normal check, which occur every slowQueryIntervalSeconds seconds. When a target check fails a restart is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRestart seconds. If the restart attempts fail then a system reboot is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRboot seconds. Thus the worst case total elapsed time from the target failing until an action occurs can be roughly calculated by:
ResponseTime = slowQueryIntervalSeconds + (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponse____


The settings in the configuration file represent likely minimum settings. If longer delays are acceptable before a response, increase the fastQueryIntervalSeconds and/or the nResponse___ settings.

--------------------------------------------------------------------------
------------ start: Example SoloManager.ini configuration file -----------
# default values for the SoloManager
#
# Period to pause when fast and slow polling the executable
fastQueryIntervalSeconds = 2
slowQueryIntervalSeconds = 6
#
# How many times to poll when getting fail result before escalating the response level
fastFailCountLimit = 2
# The initial fastFailCountLimit is usually larger, to allow time for target system startup
startFastFailCountLimit = 15
#
# How many fast cycles should occur with fails before applying response for level 1, 2, etc.
# Note: set to zero or a negative integer to suppress the response action from occurring
#nResponse1
nResponseRestart = 1
# nResponse2
nResponseReboot = 3
#
# maxTargetRunDurationHours. Non-positive value disables this feature.
# Positive value must be greater than 0.05 (hours)
maxTargetRunDurationHours = 0
#
# executable details
executableName = solo_predictor.exe
startExecutableCommandPre = c:\\Progra~1\\EVRI\\Solo_Predictor\\application\\app-bin\\win32\\
startExecutableCommandPost =
stopExecutableCommandPre = taskkill /F /IM \"
stopExecutableCommandPost = \"
#
# reboot
rebootCommandPre =
rebootCommandPost =
rebootCommand = shutdown /?
#
# executable socket details
serverIP = 127.0.0.1
serverPort = 2211
#
# log file capacity
logFileMsgCapacity = 6000
#
# Output directory. DO NOT add surrounding quotes
outdir = .
#
# must be true or false, case insensitive:
enableEmailing        = false
#
# mailserver

mailServer            = mail.eigenvector.com

mailServerPort        = 587
mailUsername          = USERNAME@eigenvector.com
mailPassword          = PASSWORD
# Note: mail Addresses cannot include spaces and must be well-formed addresses
mailRecepientAddress  = SOMEONE@gmail.com
# Use something which will be a valid e-mail address:
mailSenderAddress     = monitor@solopredictor.com
#
//---------- start: Example SoloManager.ini configuration file -----------

Starting SoloManager Automatically

SoloManager is most useful when run automatically by an operating system. This will start the Target Application in the background. The following describes how to install SoloManager as a service (Windows) or daemon (Linux).

Running SoloManager as a Windows Service

The service folder in the SoloManager main folder contains the tools necessary to run SoloManager as a Windows service. This will automatically start the application without a user logging in. Follow these instructions to install SoloManager as a Windows service:

  1. Copy the application files onto the computer on which the application is to be run.
  2. Configure solomanager.ini as needed for the intended behavior.
  3. Copy solomanager.ini into the "service" folder. This copy of solomanager.ini will be used by the service.
  4. Run the Install_Service.bat file in the service folder to install the service (this batch file must be run by a user with administrative privileges).

Errors and status messages will be reported to the log files stored in the service/logs folder. To move logs to a different location, edit the service/conf/service.conf file. You can also modify the logging behavior in this file (maximum length, number of log backups, etc.)

On some systems, the service must be executed with the credentials of a specific user in order for the target application to start. In these cases, the service will not be able to start the Target Application and the log will reflect this problem. In these cases, edit the conf\service.conf file and locate the section which defines the wrapper.ntservice.account and wrapper.ntservice.password settings. The documentation in the file describes how to modify these settings.

To uninstall the service, run the Uninstall_Service.bat file (as an administrator.)

If you have problems, try running the test script:

 Test_Service

to see if the server will start when run manually.

Running SoloManager as a Unix/Linux Daemon

The daemon_linux folder in the SoloManager main folder contains the tools necessary to run SoloManager as a Linux daemon. This will automatically start the application without a user logging in. Follow these instructions to install SoloManager as a Linux Daemon:

  1. Copy the application files onto the computer on which the application is to be run.
  2. Configure solomanager.ini as needed for the intended behavior.
  3. Copy solomanager.ini into the daemon_linux folder. This copy of solomanager.ini will be used by the daemon.
  4. Run the Install_Daemon script to install the daemon (this batch file must be run by a user with root privileges).
./Install_Service

NOTE: In order to execute this script and have the daemon operate correctly, you may have to manually set the "execute" bit on all files in the top-level daemon_linux folder to "on" using the chmod command inside the daemon_linux folder:

chmod 755 *

Errors and status messages will be reported to the log files stored in the daemon_linux/logs folder. To move logs to a different location, edit the daemon_linux/conf/wrapper.conf file. You can also modify the logging behavior in this file (maximum length, number of log backups, etc.)

To uninstall the daemon, run the Uninstall_Daemon script (as root.)

./Uninstall_Daemon

If you have problems, try running the test script:

./Test_Daemon

to see if the server will start when run manually.