Limitations and Considerations

The FMJ Systems Reference Implementation was designed and built within the constraints of a capstone project. With limited hardware, unknown internet connectivity for the demonstration, and a compressed timeline, there are some things that are worth considering if this architecture were to be implemented in a production setting. In this document, I talk about some of the things that would be different in a production implementation of this architecture

Simulated Internet

The first thing to note, and more obvious, is the removal of the simulated internet, and usage of authoritative DNS name provider and certificate authority. The reason I included the simulated internet network in my reference implementations was that I was unsure about what kind of network connectivity I would have for the IT Expo demonstration

API Automation

This is probably the biggest consideration out of all. Through the usage of API calls to Proxmox it would be possible to to automate the setup of tenant resources. Template VMs and CTs could be created to use for cloning into new environments. Currently, the tenant onboarding process is a manual process and introducing API Automation, this would make the whole process much more efficient and remove the possibility for human error.

With some web development, it would be possible to make a self service portal for tenants to provision resources within limitations. Unfortunately Proxmox does not offer resource allocation limitations on resource pools but with a custom self service portal, that would be an option. In this reference implementation, billing is using a fixed monthly cost. the Proxmox REST API exposes CPU, RAM, and network utilization on a per VM basis. Implementing a way to poll the API for resource usage per VM then aggregating the data by tenant resource pool could be used to track usage for billing purposes.

Storage and Backup Considerations

Unfortunately due to the limitations of the hardware that I have, I was not able to implement ideal storage and backup plans. Currently the storage on the cluster consists of each node containing a SSD to run all tenant workloads, which is partitioned using ZFS datasets for each tenant. each cluster node that will be either hosting or standby for a respective tenants resources will have a dataset allocated to it, which is then replicated across the respective nodes to provide high availability. ideally, the cluster would take advantage of CEPH storage clustering, which would allow the cluster to have redundant salable storage across the tenant cluster.

Unfortunately, I do not have backups implemented because of a lack of resources. Although, I do have a plan/conceptualization of how I would implement backups for the architecture. This would include introduce a backup and storage cluster dedicated to hosting backups of tenant and provider workloads. The backup plan would include taking a daily backup that is stored on the cluster locally, a weekly backup that is stored on the backup and storage cluster, preserving the last 3 backups, as well as a cross site backup performed bi-weekly. if another site does not exist to backup to, a cloud provider could be utilized to host off-site backups as well as encrypted prior to upload to ensure security and confidentiality.

Tenant Network Setup Considerations

currently the implementation is infrastructure heavy, with each tenant requiring switch VLAN configuration, router subinterface configuration, and Proxmox SDN Configuration. I decided to implement it this way to demonstrate cisco device configuration skills but utilizing Proxmox simple SDN zones instead of VLAN SDN zones (which require VLAN configuration on networking infrastructure) would allow for a much more streamlined configuration for tenant implementation. By utilizing simple SDN zones for tenant networking, it removes the need for router and switch configuration in the tenant onboarding process and could be completed through API Calls.

Scalability Considerations

Currently, the architecture can only support a maximum of 255 tenants. The addressing scheme is quite wasteful with ipv4 addresses. a better implementation would be to utilize a router VM like PFsense or something similar or different SDN zones, as mentioned in the previous section, to segregate tenant networks. this would streamline tenant onboarding as a standardized template could be cloned over each time a new tenant is added.

AD Integration Considerations

With the reference implementation, each tenant is provided with a active directory environment. This integration allows tenants to login to the Proxmox web interface to manage their workloads. RBAC is applied to the tenants imported groups and users to isolate their access to resources. Existing AD infrastructure could be implemented for clients that already have an AD environment or other identity/SSO solutions that are compatible with OpenID Connect (OIDC). In place of a AD DC, a regular DNS server for the tenants internal services could be implemented under the tenants control.

DNS Considerations

the DNS hierarchy in the reference implementation consists of 3 tiers, the simulated internet DNS acting as an authoritative name server, the shared services DNS server acting as a middle layer, and the tenant AD DC handling internal DNS resolution. tenant resources are configured to use their respective AD DC as their DNS server, which will forward unknown requests to the shared services DNS, and forwarded to the simulated internet DNS server.

The shared services DNS server is fully managed by FMJ systems and tenants have no administration over the shared services DNS server, its role is to provide logging for FMJ Systems. In the reference implementation, the shared services DNS contains records for public facing tenant resources pointing toward the local address for the proxy manager. This is because I am not able to implement NAT hair pinning due to a lack of NVI Support on cisco IOS-XE, which is required for that functionality, and connections would fail if originating from an internal NAT address, trying to reach the outside NAT address.

Bind9 was chosen as the DNS server for both the simulated internet and shared services as it is the industry standard for authoritative DNS and offered the zone management capabilities required for the simulated internet DNS. AdGuard was considered for the shared services role but Bind9 was selected to keep the DNS infrastructure consistent across both servers. PowerDNS was identified as a better fit for the shared services logging role but was discovered too late to implement. in a production deployment, PowerDNS paired with a SIEM or centralized log aggregation platform would replace Bind9 in the shared services tier.