Troubleshooting Guide
Setting up a Compute Resource Node can be a daunting task. This page is here to help you troubleshoot the most common issues.
- Ensure to backup configuration files before making changes.
- Monitor the node after each troubleshooting step to check for resolution.
- Document each step taken for future reference or for support if needed.
- If you are unable to resolve the issue, then please check out the latest issues on the Discourse Forum for support.
1) 404: Invalid message reference
Issue Summary
After setting up a CRN, users may encounter a 404: Invalid message reference
error when attempting to connect to the node's diagnostic page.
Probable Cause
- SSL configuration may be incomplete or incorrect, despite successful SSL activation.
- The hostname might not be correctly configured in the aleph-vm settings.
Troubleshooting Steps
-
Recheck SSL Configuration:
- Confirm that SSL certificates are correctly installed and configured.
- Review the SSL configuration in the web server (e.g., Caddy, Nginx) to ensure it's correctly pointing to the intended ports with the right certificate paths.
-
Configure Hostname Correctly:
- Ensure the hostname is properly configured as per the CRN installation guide.
- Make sure the domain name in the supervisor.env file matches the domain used in your SSL configuration.
-
Restart Services:
- After updating the hostname, restart the relevant services to apply the changes.
- This may include restarting the Docker container and the web server service.
-
Review Log Files:
- If the problem still persists, check the log files of both the Docker container and the web server for any specific error messages related to SSL or hostname configurations.
2) SQUASHFS Errors in Diagnostic VM
Issue Summary
Users may encounter SQUASHFS errors indicating a failure to decompress data, suggesting possible corruption of the runtime diagnostic VM.
Symptoms
Repeated SQUASHFS errors in the logs such as
Failed to read block
Unable to read data cache entry
zlib decompression failed, data probably corrupt
related to a specific block.
Probable Cause
The runtime of the new diagnostic VM appears to be improperly downloaded or corrupted.
Troubleshooting Steps
-
Stop the Supervisor: It is important to stop the VMs first when doing the operations below.
-
Clear Cache: Remove the cache of the problematic file using the diagnostic VM hash. This can be done by deleting the file located at
/var/cache/aleph/runtime/$RUNTIME_HASH
.- Navigate to the cache directory:
cd /var/cache/aleph/vm/runtime/
. - Locate the file with the corresponding
$RUNTIME_HASH
. - Remove the file:
- Navigate to the cache directory:
-
Restart Supervisor: After deleting the problematic file, restart the supervisor system. This should trigger the re-download of the runtime file.
- Restart the supervisor:
sudo systemctl restart supervisor
(oraleph-vm-supervisor.service
when installing from source).
- Restart the supervisor:
-
Re-download: Upon restart, the system will automatically attempt to re-download the runtime, replacing the corrupted file.
- If the problem persists, further investigation into network stability or hardware integrity may be necessary.
3) Missing Diagnostic VM Metrics
Issue Summary
The diagnostic_vm_latency
metrics data is missing for your CRN, even though virtualization is reportedly operational.
Users can check the raw network metrics data for their node on the Message Explorer.
For more info on the data found there, see Metrics.
Two urls are used to check this marker:
/vm/67705389842a0a1b95eaa408b009741027964edc805997475e95c505d642edd8
(legacy runtime)/vm/3fc0aa9569da840c43e7bd2033c3c580abb46b007527d6d20f2d4e98e867f7af
(current runtime)
Check that both work on your node, on an URL similar to
https://my-compute-node.example/vm/3fc0aa9569da840c43e7bd2033c3c580abb46b007527d6d20f2d4e98e867f7af
Symptoms
- No
diagnostic_vm_latency
entry in the node's diagnostic data. - Node appears functional, and virtualization is reportedly operational.
- Previous cache clearing solution was ineffective.
Troubleshooting Steps
-
Upgrade Node Software:
- Ensure the node is running the latest CRN version.
-
Disable IPv6 Forwarding:
- If upgrading does not resolve the issue, try disabling IPv6 forwarding:
- Set
ALEPH_VM_IPV6_FORWARDING_ENABLED=False
in/etc/aleph-vm/supervisor.env
. - Manually check if IPv6 forwarding is still active: If the output is 1, disable it with:
- Set
- If upgrading does not resolve the issue, try disabling IPv6 forwarding:
-
Clear Cache:
-
Contact Cloud Provider:
- If the issue persists, ask your Cloud Provider: "I tried to enable IPv6 forwarding on my server. This makes my machine unreachable over IPv6. Why is that?"
4) IPv6 Unreachable
Issue Summary
When using IPv6 on a node, the network is unreachable.
Symptoms
ping6
command fails to connect to an IPv6 address.- The system returns the error "Network is unreachable."
Common Causes
- Incorrect IPv6 configuration.
- Network interface not configured for IPv6.
- IPv6 connectivity issues with the network.
Troubleshooting Steps
-
Check IPv6 Configuration:
- Ensure that IPv6 is enabled on the network interface.
- Verify that the IPv6 address is correctly assigned to the interface.
- Confirm that the gateway for IPv6 is set up correctly.
-
Review Netplan Configuration (for Ubuntu systems):
- Open the Netplan configuration file located typically at /etc/netplan/*.yaml.
- Check for proper syntax and settings for IPv6, including address, gateway, and nameservers.
- Example of a Netplan configuration for IPv6:
After making changes, apply them with
network: version: 2 ethernets: eth0: dhcp4: no dhcp6: no addresses: - "2602:2940:0:1f::2/64" gateway6: "2602:2940:0:1f::1" nameservers: addresses: ["2001:4860:4860::8888", "2001:4860:4860::8844"]
sudo netplan apply
.
-
Check Network Interface:
- Use
ip -6 addr show
to check if the IPv6 address is assigned to the network interface. - Use
ip -6 route show
to verify the default route for IPv6.
- Use
-
Test Network Connectivity:
- Use
ping6
to ping the local IPv6 gateway or known IPv6 addresses like Google's DNS2001:4860:4860::8888
to test connectivity.
- Use
5) Persistent Storage Corruption
Issue Summary
A Compute Resource Node exhibits issues with the persistent_storage
feature.
Symptoms
- Errors in the
persistent_storage
field from the diagnostic on the index page of a CRN or on the/status/check/fastapi
API endpoint. - The endpoint
/state/increment
on the diagnostic VM returns an error 500. - The field
diagnostic_vm_latency
is missing from the metrics.
Probable Cause
The diagnostic VM tests the capability of the VM to persist data on the host. This is done by incrementing a counter in a JSON file, itself stored in a persistent volume.
When a diagnostic virtual machine happens to be stopped while writing data to this file, it is possible to end up with a corrupt file that, for example, only contains part of the expected JSON data and cannot be parsed.
Troubleshooting Steps
-
Identify Corrupted Volumes:
- Identify the identifier of the two diagnostic VMs from the variables
CHECK_FASTAPI_VM_ID
andLEGACY_CHECK_FASTAPI_VM_ID
in the configuration of aleph-vm.
- Identify the identifier of the two diagnostic VMs from the variables
-
Stop the service:
- Stop the service to avoid any further corruption:
-
Remove Corrupted Volumes:
- Remove the corrupted files. Here are the commands to remove the identified corrupted volumes:
-
Restart Services:
- After removing the corrupted volume files, restart the affected services to trigger the recreation of the necessary storage files:
-
Verify System Stability:
- Check the dashboard of the index page of the CRN or open the storage test endpoint on both VMs opening: