NetCloud Troubleshoot

NetCloud UI

Unable to login NetCloud machine from CLI

Possible Reasons #1	Machine IP is down 1) If machine is down 2) if machine is up
Steps to Diagnose	Ping machine IP from terminal using “ping” command.
Command Used	ping 1.2.3.4, where 1.2.3.4 is machine IP.
Commands to validate / Solution	Contact machine owner.

Unable to start NetCloud test. Getting error.

Possible Reasons #1	1) License error
Steps to Diagnose	Check license is valid or not using command nsu_show_license -l1) This showing license file is not present2) This showing invalid licence/expired license.
Command Used	nsu_show_license -l
Solution	Contact to cavisson client support team for new license.

Possible Reasons #2	Cmon may not be running on controller and generators
Steps to Diagnose	Check cmon is running or not using ps command.
Command Used	ps -ef \|grep cmon
Solution	Start/ restart cmon using command /etc/init.d/cmon start

Possible Reasons #3	Due to generator file missing on the controller.
Steps to Diagnose	Check file availability in /home/cavisson/etc/,netcloud directory from CLI
Command Used	ls -ltra /home/cavisson/etc/,netcloud
Solution	Add generators from UI. This option is available at scenarios>>add generator>>generator file UI.

Possible Reasons #4	4)Generator information is missing in generator file
Steps to Diagnose	Check InitScreen UI. This will show an error regarding Generator information is missing
Solution	Add generators from UI. This option is available at scenarios>>add generator>>generator file UI.

Possible Reasons #5	Any wrong keyword used in scenario
Steps to Diagnose	Check InitScreen UI. This will show an error regarding wrong/missing keyword used.
Command Used	vi scenrioName.conf
Solution	Correct keyword from scenario UI.

Possible Reasons #6	Script is not compiled.
Steps to Diagnose	InitScreen will show an error regarding script
Solution	Correct script from script manager.

Possible Reasons #7	PostgreSQL service is not running on controller.
Steps to Diagnose	Check using ps command
Command Used	ps -ef \|grep postgresql
Solution	start postgresql service

Users went down

Possible Reasons #1	1) Test stopped on few/all generators. 2) Some CVMs got killed on generators.
Steps to Diagnose	Check below logs 1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log 2) $NS_WDIR/logs/TRXX/TestRunOutput.log Test stopped or CVMs killed due to core dump on code function or system kernel. Check dmesg -T for segfault. Also check core file at /home/cavisson/core_files. Check backtrace using gdb and analyse frames where dump created.
Command Used	vi 1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log 2) $NS_WDIR/logs/TRXX/TestRunOutput.log
Solution	Contact cavisson product team for code fix.

Generators got discarded

Possible Reasons #1	Generator is busy and generator got killed due to delay in progress report
Steps to Diagnose	1) Check ns_trace.log present at $NS_WDIR/logs/TRxx/partition/ns_logs/ns_trace.log 2) search error message “Did not get progress report for 300000 msecs” in ns_trace.log. 2) Check netstat logs of all the generators and controllers. >> controller netstat log will be at $NS_WDIR/logs/TRxx Generator netstat logs will be at $NS_WDIR/logs/TRxx/NetCloud/generator_name/TRxx/netstat.txt 3) From above logs you can check where data is stcuk from rec Q or send Q. 4) In this case all NVMs will be busy upto 99%”
Command Used	Use keyword as below NUM_NVM 2 MACHINE. it will generate a total 4 NVMs.Note: value 2 is the example.
Solution	Provide sufficient number of NVMs in test to sustain load.

Possible Reasons #1	Bandwidth is fully utilised
Steps to Diagnose	Check graph of Received Throughput from Dashboard. This graphs available at Test Metrics >> Https request >> TCP receive throughput
Command Used	1) We can check stuck sample in Send queue by using command netstat -natp2) And for checking bandwidth utilisation we can use a command nload, iftop, iperf etc
Solution	Reduce load from that particular generator.

Possible Reasons #3	Controller doesn’t send acknowledgement message for the generator.
Steps to Diagnose	1) Check controller system health like load average. How to troubleshoot load average has already been explained.2) Check whether controller sent acknowledgement to generator or not in ns_trace.log. path of this log has mentioned above.
Command Used	Top
Solution	Make controller health stable

Possible Reasons #4	Old or Bad kernel on Generator Machine
Steps to Diagnose	Check kernel on generator using linux command uname -r.
Command Used	uname -r
Solution	Upgrade latest kernel.

Possible Reasons #5	NVMs of generator are stuck
Steps to Diagnose	1) This case happens when an NVM gets stuck because of resources blocked to use for example, if disk IO or CPU utilisation is high of generator machine. Then NVM can’t process and delay comes in sample generation. 2) Check scripts using in test. May be there is some loop applied where NVMs are stuck in process.
Command Used	$NS_WDIR/scripts/project/subProject/scrips_name
Solution	Correct script

Getting 100% failure on generators

Possible Reasons #1	Generators IPs are not whitelisted at application end
Steps to Diagnose	Check host using ping command or using wget.
Command Used	1) ping hostname 2) wget hostname
Solution	Not getting page dump Report

Possible Reasons #2	G_TRACING keyword is not enabled in the scenario
Steps to Diagnose	Check scenario.
Command Used	Check KeywordDefination.dat file at $NS_WDIR/etc
Solution	Use right Keyword in scenario

Machine Core

CPU utilization is high

Possible Reasons #1	System CPUs are occupied by all process and it is unavailable for processing other requests
Steps to Diagnose	Check using top command which processes are taking more cpu to process. Go through below link for more debugging. https://bobcares.com/blog/high-cpu-utilization/
Commands to validate	top
Solution	Fix that resources those are taking more cpu. 1) Fixes on configuration level.2) Stop process if not needed.

Load Average is High

Possible Reasons #1	System is overloaded where many processes are waiting for system resources
Steps to Diagnose	Check using top command which processes are taking more system resources like CPU, RAM, Disk etc. Go through below link for more debugging . https://martincarstenbach.wordpress.com/2013/06/25/troubleshooting-high-load-average-on-linux/
Commands to validate	top
Solution	Fix that process taking more system resources

NetCloud Core

NetCloud test stuck on database creation

Possible Reasons #1	Database is busy on some other task to process
Steps to Diagnose	1) Check any process running for database or any uploading or downloading happening in db.
Commands to validate
Solution

Possible Reasons #2 Possible Reasons #3	1.) Sometimes nsu_db_upload process running of older testruns those are not running currently. 2.) Sometimes older nia_file_aggrigator process are running
Steps to Diagnose	1) Check using ps -ef \|grep nsu_db_upload. 2) Check any test running with corresponding process using nsu_show_all_netstorm
Commands to validate	1) ps -ef \|grep nsu_db_upload. 2) nsu_show_all_netstorm 3) kill -9 pid
Solution	Stop these older process by killing them

NetCloud test fails in middle of test

Possible Reason #1 Possible Reason #2	1.) This may happen due to core dump on controller due to some fault in code or due to system kernel 2.) This may happen due to NVMs failure with core dump on failed generators
Steps to Diagnose	Check below logs 1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log 2) $NS_WDIR/logs/TRXX/TestRunOutput.log Test stopped or CVMs killed due to core dump on code function or system kernel. Check dmesg -T for segfault. Also check core file at /home/cavisson/core_files. Check backtrace using gdb and analyse frames where dump created.
Commands to validate	gdb
Solution	Contact cavisson client support team

Possible Reasons #3	Not enough space left on controller/generators
Steps to Diagnose	1) Check using df -h 2) Also can be check using nsu_sever_admin command
Commands to validate	1) Df -h 2) nsu_server_admin -s ip -c “df -h”
Solution

Possible Reasons #4	Some has stopped test forcefully
Steps to Diagnose	Ping generator IP
Commands to validate	ping ip
Solution	contact CS team

Possible Reasons #5	Generator went down
Steps to Diagnose	Check nsu_stop_test.log. It is present at $NS_WDIR/logs/TRxx
Commands to validate	vi $NS_WDIR/logs/TRxx/nsu_stop_test.log
Solution	Restart test if required

Not able to start test due to shared memory issue

Possible Reasons #1	Check NS/Generator System Health
Steps to Diagnose	1) Run command cat /proc/sys/kernel/shmmax 2) Value must be greater than buffer request in script. 3) On cavisson cloud machine there is approximate 20GB
Commands to validate	cat /proc/sys/kernel/shmmax
Solution	Check this value. It required root access

Unable to start test due to unknown host error

Possible Reasons #1	Due to DNS nameserver missing in entry file.
Steps to Diagnose	1) Check file cat /var/run/dnsmasq/resolv.conf 2) Entry will be like nameserver 8.8.8.8 3) If entry is not there then enter value manually.
Commands to validate	cat /var/run/dnsmasq/resolv.conf.
Solution	Keep nameserver entry in resolv.conf file

Possible Reasons #2	Due to host not reachable from source IP
Steps to Diagnose	Check host using ping command or using wget.
Commands to validate	1) ping hostname 2) wget hostname
Solution	Get whitelist source ip to host application.