NetCloud Troubleshoot

Unable to login NetCloud machine from CLI

 

Possible Reasons #1Machine IP is down 1) If machine is down Screen12) if machine is up Screen2
Steps to DiagnosePing machine IP from terminal using “ping” command.
Command Usedping 1.2.3.4,

 

where 1.2.3.4 is machine IP.

Commands to validate / SolutionContact machine owner.

 

Unable to start NetCloud test. Getting error.

 

Possible Reasons #11) License errorScreen3
Steps to DiagnoseCheck license is valid or not using command nsu_show_license -l1) This showing license file is not present2) This showing invalid licence/expired license.
Command Usednsu_show_license -l
SolutionContact to cavisson client support team for new license.

 

Possible Reasons #2Cmon may not be running on controller and generatorsScreen4
Steps to DiagnoseCheck cmon is running or not using ps command.
Command Usedps -ef |grep cmon
SolutionStart/ restart cmon using command

 

/etc/init.d/cmon start

 

Possible Reasons #3Due to generator file missing on the controller.Screen5
Steps to DiagnoseCheck file availability in /home/cavisson/etc/,netcloud directory from CLI
Command Usedls -ltra /home/cavisson/etc/,netcloud
SolutionAdd generators from UI. This option is available at scenarios>>add generator>>generator file UI.

 

 

Possible Reasons #44)Generator information is missing in generator fileScreen6
Steps to DiagnoseCheck InitScreen UI. This will show an error regarding Generator information is missing
SolutionAdd generators from UI. This option is available at scenarios>>add generator>>generator file UI.

 

Possible Reasons #5Any wrong keyword used in scenario Screen7
Steps to DiagnoseCheck InitScreen UI. This will show an error regarding wrong/missing keyword used.
Command Usedvi scenrioName.conf
SolutionCorrect keyword from scenario UI.

 

 

Possible Reasons #6Script is not compiled. Screen8
Steps to DiagnoseInitScreen will show an error regarding script
SolutionCorrect script from script manager.

 

Possible Reasons #7PostgreSQL service is not running on controller. Screen9
Steps to DiagnoseCheck using ps command
Command Usedps -ef |grep postgresql
Solutionstart postgresql service

 

Users went down

 

Possible Reasons #11) Test stopped on few/all generators.

 

2) Some CVMs got killed on generators.

Screen12

Steps to Diagnose

Check below logs

1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log

2) $NS_WDIR/logs/TRXX/TestRunOutput.log

Test stopped or CVMs killed due to core dump on code function or system kernel. Check dmesg -T for segfault. Also check core file at /home/cavisson/core_files. Check backtrace using gdb and analyse frames where dump created.

Command Usedvi

 

1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log

2) $NS_WDIR/logs/TRXX/TestRunOutput.log

SolutionContact cavisson product team for code fix.

 

Generators got discarded

 

Possible Reasons #1Generator is busy and generator got killed due to delay in progress report Screen13
Steps to Diagnose

1) Check ns_trace.log present at $NS_WDIR/logs/TRxx/partition/ns_logs/ns_trace.log

2) search error message “Did not get progress report for 300000 msecs” in ns_trace.log.

2) Check netstat logs of all the generators and controllers. >> controller netstat log will be at $NS_WDIR/logs/TRxx Generator netstat logs will be at $NS_WDIR/logs/TRxx/NetCloud/generator_name/TRxx/netstat.txt

3) From above logs you can check where data is stcuk from rec Q or send Q.

4) In this case all NVMs will be busy upto 99%”

Command UsedUse keyword as below NUM_NVM 2 MACHINE. it will generate a total 4 NVMs.Note: value 2 is the example.
SolutionProvide sufficient number of NVMs in test to sustain load.

 

Possible Reasons #1Bandwidth is fully utilised Screen14
Steps to DiagnoseCheck graph of Received Throughput from Dashboard.

 

This graphs available at Test Metrics >> Https request >> TCP receive throughput

Command Used1) We can check stuck sample in Send queue by using command netstat -natp2) And for checking bandwidth utilisation we can use a command nload, iftop, iperf etc
SolutionReduce load from that particular generator.

 

Possible Reasons #3Controller doesn’t send acknowledgement message for the generator. Screen15
Steps to Diagnose1) Check controller system health like load average. How to troubleshoot load average has already been explained.2) Check whether controller sent acknowledgement to generator or not in ns_trace.log. path of this log has mentioned above.
Command UsedTop
SolutionMake controller health stable

 

Possible Reasons #4Old or Bad kernel on Generator Machine
Steps to DiagnoseCheck kernel on generator using linux command uname -r.
Command Useduname -r
SolutionUpgrade latest kernel.

 

Possible Reasons #5NVMs of generator are stuck Screen16
Steps to Diagnose

1) This case happens when an NVM gets stuck because of resources blocked to use for example, if disk IO or CPU utilisation is high of generator machine. Then NVM can’t process and delay comes in sample generation.

2) Check scripts using in test. May be there is some loop applied where NVMs are stuck in process.

Command Used$NS_WDIR/scripts/project/subProject/scrips_name
SolutionCorrect script

 

Getting 100% failure on generators

 

Possible Reasons #1Generators IPs are not whitelisted at application end Screen26
Steps to DiagnoseCheck host using ping command or using wget.
Command Used1) ping hostname

 

2) wget hostname

SolutionNot getting page dump Report

 

Possible Reasons #2G_TRACING keyword is not enabled in the scenario
Steps to DiagnoseCheck scenario.
Command UsedCheck KeywordDefination.dat file at $NS_WDIR/etc
SolutionUse right Keyword in scenario

CPU utilization is high

 

Possible Reasons #1System CPUs are occupied by all process and it is unavailable for processing other requests Screen10
Steps to DiagnoseCheck using top command which processes are taking more cpu to process. Go through below link for more debugging. https://bobcares.com/blog/high-cpu-utilization/
Commands to validatetop
SolutionFix that resources those are taking more cpu. 1) Fixes on configuration level.2) Stop process if not needed.

 

Load Average is High

 

Possible Reasons #1System is overloaded where many processes are waiting for system resources Screen11
Steps to DiagnoseCheck using top command which processes are taking more system resources like CPU, RAM, Disk etc. Go through below link for more debugging . https://martincarstenbach.wordpress.com/2013/06/25/troubleshooting-high-load-average-on-linux/
Commands to validatetop
SolutionFix that process taking more system resources

NetCloud test stuck on database creation

 

Possible Reasons #1Database is busy on some other task to process Screen17
Steps to Diagnose1) Check any process running for database or any uploading or downloading happening in db.
Commands to validate 
Solution 

 

Possible Reasons #2

 

Possible Reasons #3

1.) Sometimes nsu_db_upload process running of older testruns those are not running currently.

 

2.) Sometimes older nia_file_aggrigator process are running

Steps to Diagnose1) Check using ps -ef |grep nsu_db_upload.

 

2) Check any test running with corresponding process using nsu_show_all_netstorm

Commands to validate1) ps -ef |grep nsu_db_upload.

 

2) nsu_show_all_netstorm

3) kill -9 pid

SolutionStop these older process by killing them

 

NetCloud test fails in middle of test

 

Possible Reason #1

 

Possible Reason #2

1.) This may happen due to core dump on controller due to some fault in code or due to system kernel

 

2.) This may happen due to NVMs failure with core dump on failed generators

Steps to Diagnose

Check below logs

1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log

2) $NS_WDIR/logs/TRXX/TestRunOutput.log

Test stopped or CVMs killed due to core dump on code function or system kernel. Check dmesg -T for segfault. Also check core file at /home/cavisson/core_files. Check backtrace using gdb and analyse frames where dump created.

Commands to validategdb
SolutionContact cavisson client support team

 

Possible Reasons #3Not enough space left on controller/generators
Steps to Diagnose1) Check using df -h

 

2) Also can be check using nsu_sever_admin command

Commands to validate1) Df -h

 

2) nsu_server_admin -s ip -c “df -h”

Solution 

 

Possible Reasons #4Some has stopped test forcefully Screen18
Steps to DiagnosePing generator IP
Commands to validateping ip
Solutioncontact CS team

 

Possible Reasons #5Generator went down Screen19
Steps to DiagnoseCheck nsu_stop_test.log. It is present at $NS_WDIR/logs/TRxx
Commands to validatevi $NS_WDIR/logs/TRxx/nsu_stop_test.log
SolutionRestart test if required

 

Not able to start test due to shared memory issue

 

Possible Reasons #1Check NS/Generator System Health Screen20
Steps to Diagnose1) Run command cat /proc/sys/kernel/shmmax

 

2) Value must be greater than buffer request in script.

3) On cavisson cloud machine there is approximate 20GB

Commands to validatecat /proc/sys/kernel/shmmax
SolutionCheck this value. It required root access

 

Unable to start test due to unknown host error

 

Possible Reasons #1Due to DNS nameserver missing in entry file. Screen25
Steps to Diagnose1) Check file cat /var/run/dnsmasq/resolv.conf

 

2) Entry will be like nameserver 8.8.8.8

3) If entry is not there then enter value manually.

Commands to validatecat /var/run/dnsmasq/resolv.conf.
SolutionKeep nameserver entry in resolv.conf file

 

Possible Reasons #2Due to host not reachable from source IP
Steps to DiagnoseCheck host using ping command or using wget.
Commands to validate1) ping hostname

 

2) wget hostname

SolutionGet whitelist source ip to host application.