SoloManager: Difference between revisions
imported>Donal |
imported>Donal |
||
Line 81: | Line 81: | ||
* '''startExecutableCommandPre''' Full path to the program listed as executableName (unless the program's folder has been added to the system path by the installer.) | * '''startExecutableCommandPre''' Full path to the program listed as executableName (unless the program's folder has been added to the system path by the installer.) | ||
* '''outdir''' Specifies the folder which should contain the log files. By default these will be written to the same folder as the configuration file, but another file may be preferable if the user does not have read/write permissions to that folder. | * '''outdir''' Specifies the folder which should contain the log files. By default these will be written to the same folder as the configuration file, but another file may be preferable if the user does not have read/write permissions to that folder. | ||
* '''maxTargetRunDurationHours''' The target will be stopped and restarted every maxTargetRunDurationHours if this is a positive number. It has no effect if it is not a positive number. | |||
* '''nResponseRestart''' and '''nResponseReboot''' indicates how many target check failures must occur before the application is restarted and/or the system is rebooted (respectively). If the Target Application fails after starting successfully, it will be detected by the next normal check, which occur every slowQueryIntervalSeconds seconds. When a target check fails a restart is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRestart seconds. If the restart attempts fail then a system reboot is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRboot seconds. Thus the worst case total elapsed time, in seconds, from the target failing until an action occurs can be roughly calculated by: | * '''nResponseRestart''' and '''nResponseReboot''' indicates how many target check failures must occur before the application is restarted and/or the system is rebooted (respectively). If the Target Application fails after starting successfully, it will be detected by the next normal check, which occur every slowQueryIntervalSeconds seconds. When a target check fails a restart is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRestart seconds. If the restart attempts fail then a system reboot is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRboot seconds. Thus the worst case total elapsed time, in seconds, from the target failing until an action occurs can be roughly calculated by: | ||
Revision as of 15:17, 14 June 2010
This page describes the SoloManager program and its usage
Purpose
The purpose of SoloManager is to start a target program locally and to then continuously monitor the target program's availability. The target program responds to tcp/ip queries on a specified port if it is operating normally. If the target program becomes unresponsive for for a specified period of time then the SoloManager can terminate it and restart it, and/or reboot the host computer entirely. Many aspects of the SoloManager program can be configured by specifying values in the SoloManager.ini text file.
Description of components
SoloManager .jar file
This contains the SoloManager program, and a sample SoloManager.ini file. It also contains all necessary Java library files.
SoloManager configuration file
Contains configuration details specifying how the SoloManager operates. See the example configuration file listed below.
Target program
This is an the program we wish to monitor and to ensure is always available. It must expose a TCP port and respond to socket queries on that port.
Wrapper service (optional)
This is an optional component which will start SoloManager whenever the host computer is booted up. It is described in the sections below about starting SoloManager as a Service or Deamon.
Relationships and processing sequence
These components are related as shown in the SoloManager flowchart.
Typical process flow
The SoloManager is typically started automatically when the host computer is booted up, usually via the Service and Daemon Wrapper.
Once started, the SoloManager begins by reading in values for all configurable parameters from the SoloManager.ini file. This file can be edited by the user to specify their preferred settings but it must be located in the same directory as the SoloManager jar file. This is where the user specified the name of the target executable which SoloManager will start and monitor, for example.
SoloManager then begins its unending loop where it checks the status of the target program. SoloManager creates a socket connection to the target program and sends a query. If the target program is alive it sends a response which must match what SoloManager is expecting.
SoloManager checks the Target program is alive by:
- opening a socket on the target program's port
- Sending the parameter "msgToSocket" to the socket and verifying that the first line returned from the socket equals the parameter "expectedResponse".
- If the response is not valid SoloManager will repeat this check up to "fastFailCountLimit" times with a pause of "FastQueryIntervalSeconds" seconds.
- If the response is valid the check is complete with result success.
If the target check was successful then the failure counter is reset to zero and the loop repeats after a specified pause period of "SlowQueryIntervalSeconds" seconds. If the target check was not successful then the failure counter is incremented. The loop continues until this counter reaches a specified "nResponseRestart" counter value, whereupon SoloManager issues a command to restart the the target program and continue with the loop. If the target program restarts then the next check will be successful so the loop continues normally.
If the restart command does not succeed in restarting the target program then the target checks will continue failing and the failure counter incrementing until it eventually attains the specified "nResponseReboot" counter value. At this point SoloManager issues a command to reboot the host computer and the entire process begins again.
During these operations SoloManager writes status information to a log file and optionally can send e-mail to report events. The log file will be located in the directory specified by "outdir". Its size is limited to the last "logFileMsgCapacity" log messages. E-mailed alerts are optional and are enabled by setting "enableEmailing" = true. In this case e-mail messages will be sent to the specified user whenever:
- The SoloManager program starts.
- SoloManager is about to issue a restart command for the target program.
- SoloManager is about to issue a reboot command to the host computer's operating system.
Dependencies
SoloManager requires the following:
- Java version 1.5 or later is available on the host computer.
- It must be able to write to a log file on the filesystem.
- It must be able to issue a system reboot command (command can be defined within the configuration file).
- Operating system may be any of: Linux, Windows (2000, XP, 2003, 2008, Vista, 7), or MAC
Configuration File
The configuration file will almost always need to be modified for the individual application and installation settings. An example file is included below, but a few key settings to modify include:
- executableName Name of the program to run (usually either Solo.exe or Solo_Predictor.exe.)
- startExecutableCommandPre Full path to the program listed as executableName (unless the program's folder has been added to the system path by the installer.)
- outdir Specifies the folder which should contain the log files. By default these will be written to the same folder as the configuration file, but another file may be preferable if the user does not have read/write permissions to that folder.
- maxTargetRunDurationHours The target will be stopped and restarted every maxTargetRunDurationHours if this is a positive number. It has no effect if it is not a positive number.
- nResponseRestart and nResponseReboot indicates how many target check failures must occur before the application is restarted and/or the system is rebooted (respectively). If the Target Application fails after starting successfully, it will be detected by the next normal check, which occur every slowQueryIntervalSeconds seconds. When a target check fails a restart is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRestart seconds. If the restart attempts fail then a system reboot is invoked after (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponseRboot seconds. Thus the worst case total elapsed time, in seconds, from the target failing until an action occurs can be roughly calculated by:
ResponseTime = slowQueryIntervalSeconds + (fastQueryIntervalSeconds+1)*fastFailCountLimit*nResponse____
The settings in the configuration file represent likely minimum settings. If longer delays are acceptable before a response, increase the fastQueryIntervalSeconds and/or the nResponse___ settings.
-------------------------------------------------------------------------- ------------ start: Example SoloManager.ini configuration file ----------- # default values for the SoloManager # # Period to pause when fast and slow polling the executable fastQueryIntervalSeconds = 2 slowQueryIntervalSeconds = 6 # # How many times to poll when getting fail result before escalating the response level fastFailCountLimit = 2 # The initial fastFailCountLimit is usually larger, to allow time for target system startup startFastFailCountLimit = 15 # # How many fast cycles should occur with fails before applying response for level 1, 2, etc. # Note: set to zero or a negative integer to suppress the response action from occurring #nResponse1 nResponseRestart = 1 # nResponse2 nResponseReboot = 3 # # maxTargetRunDurationHours. Non-positive value disables this feature. # Positive value must be greater than 0.05 (hours) maxTargetRunDurationHours = 0 # # executable details executableName = solo_predictor.exe startExecutableCommandPre = c:\\Progra~1\\EVRI\\Solo_Predictor\\application\\app-bin\\win32\\ startExecutableCommandPost = stopExecutableCommandPre = taskkill /F /IM \" stopExecutableCommandPost = \" # # reboot rebootCommandPre = rebootCommandPost = rebootCommand = shutdown /? # # executable socket details serverIP = 127.0.0.1 serverPort = 2211 # # log file capacity logFileMsgCapacity = 6000 # # Output directory. DO NOT add surrounding quotes outdir = . # # must be true or false, case insensitive: enableEmailing = false # # mailserver mailServer = mail.eigenvector.com mailServerPort = 587 mailUsername = USERNAME@eigenvector.com mailPassword = PASSWORD # Note: mail Addresses cannot include spaces and must be well-formed addresses mailRecepientAddress = SOMEONE@gmail.com # Use something which will be a valid e-mail address: mailSenderAddress = monitor@solopredictor.com # //---------- start: Example SoloManager.ini configuration file -----------
Starting SoloManager Automatically
SoloManager is most useful when run automatically by an operating system. This will start the Target Application in the background. The following describes how to install SoloManager as a service (Windows) or daemon (Linux).
Running SoloManager as a Windows Service
The service folder in the SoloManager main folder contains the tools necessary to run SoloManager as a Windows service. This will automatically start the application without a user logging in. Follow these instructions to install SoloManager as a Windows service:
- Copy the application files onto the computer on which the application is to be run.
- Configure solomanager.ini as needed for the intended behavior.
- Copy solomanager.ini into the "service" folder. This copy of solomanager.ini will be used by the service.
- Run the Install_Service.bat file in the service folder to install the service (this batch file must be run by a user with administrative privileges).
Errors and status messages will be reported to the log files stored in the service/logs folder. To move logs to a different location, edit the service/conf/service.conf file. You can also modify the logging behavior in this file (maximum length, number of log backups, etc.)
On some systems, the service must be executed with the credentials of a specific user in order for the target application to start. In these cases, the service will not be able to start the Target Application and the log will reflect this problem. In these cases, edit the conf\service.conf file and locate the section which defines the wrapper.ntservice.account and wrapper.ntservice.password settings. The documentation in the file describes how to modify these settings.
To uninstall the service, run the Uninstall_Service.bat file (as an administrator.)
If you have problems, try running the test script:
Test_Service
to see if the server will start when run manually.
Running SoloManager as a Unix/Linux Daemon
The daemon_linux folder in the SoloManager main folder contains the tools necessary to run SoloManager as a Linux daemon. This will automatically start the application without a user logging in. Follow these instructions to install SoloManager as a Linux Daemon:
- Copy the application files onto the computer on which the application is to be run.
- Configure solomanager.ini as needed for the intended behavior.
- Copy solomanager.ini into the daemon_linux folder. This copy of solomanager.ini will be used by the daemon.
- Run the Install_Daemon script to install the daemon (this batch file must be run by a user with root privileges).
./Install_Service
NOTE: In order to execute this script and have the daemon operate correctly, you may have to manually set the "execute" bit on all files in the top-level daemon_linux folder to "on" using the chmod command inside the daemon_linux folder:
chmod 755 *
Errors and status messages will be reported to the log files stored in the daemon_linux/logs folder. To move logs to a different location, edit the daemon_linux/conf/wrapper.conf file. You can also modify the logging behavior in this file (maximum length, number of log backups, etc.)
To uninstall the daemon, run the Uninstall_Daemon script (as root.)
./Uninstall_Daemon
If you have problems, try running the test script:
./Test_Daemon
to see if the server will start when run manually.