Computing Infrastructure — Appunti TiTilda

Indice

Data Centers

As internet adoption grew, computing shifted from local machines to centralized data centers, large facilities housing thousands of servers and providing computational power and storage for various applications.

Data centers are geographically distributed in areas with favorable cooling and power conditions, reducing user latency and improving fault tolerance.

Benefits of Centralized Computing

User Benefits:

Vendor Benefits:

Infrastructure Benefits:

Warehouse-Scale Computing

Warehouse-scale computing (WSC) is a type of data center architecture that treats thousands of interconnected servers as a single unified system.

This enables running large-scale applications (search engines, social media platforms, online gaming services) that require significant computational resources to be efficiently managed and scaled.

Many such providers also offer cloud services, virtualizing their infrastructure for external customers, allowing a traditional data center to be built on top of a warehouse-scale computing infrastructure.

Geographic Distribution

Global cloud infrastructure is hierarchically organized for redundancy and low latency:

Physical Architecture

Data center architecture is similar to that of personal computers but at a massive scale.

Computing Components:

Support Infrastructure:

Server

Servers are fundamental computing units in data centers, designed for performance, reliability, and scalability.

Form Factors

Servers are made in different standard form factors such us:

Components

All components are standardized for quick replacement and maintenance, with hot-swappable parts to minimize downtime.

Storage

WIth time the data have been moved from local towards cloud providers. This is due:

File System Abstractions

OS manages data through hierarchical abstractions:

During data deletion, the cluster is only flagged as deleted, allowing it to be overwritten.

Space Allocation

The storage unit is represented as a multiple of the cluster size:

\text{Disk Size} = \lceil \frac{\text{File Size}}{\text{Cluster Size}} \rceil \times \text{Cluster Size}

when a file is smaller than the cluster size, sime of its space is wasted leading to internal fragmentation:

\text{Wasted Space} = \text{Disk Size} - \text{File Size} When a file’s clusters are non-contiguous (fragmented), read/write operations require multiple seeks, degrading performance. In these cases it’s useful to perform defragmentation to rearrange sectors into sequential blocks.

Hard Disk Drives

Physical Structure:

HDDs contain rotating magnetic platters coated with ferromagnetic material. Data is stored as magnetic patterns organized into:

The platters are mounted on a spindle and spin at high speeds (RPM). An actuator arm with a read/write head moves across the platters to access data.

The entire assembly is enclosed in a sealed case to protect against dust, scratches, and environmental contaminants, while also providing shock resistance.

Access Time Components

During the read/write process, several time components contribute to the total access time:

The total access time is the sum of these components: T_\text{Access} = T_\text{Seek} + T_\text{Rotation} + T_\text{Transfer} + T_\text{Controller}

The Data locality, the tendency for related data to be stored close together, can significantly reduce access time by minimizing seek and rotation delays. The locality factor is represented by \alpha. The adjusted access time considering locality is:

T_\text{Access} = (1-\alpha)(T_\text{Seek} + T_\text{Rotation}) + T_\text{Transfer} + T_\text{Controller}

To reduce access time the HDDs include buffer memory that exploits spatial locality by storing neighboring sectors.

Writes target cache first, then flush to platters. This reduces repeated disk access for frequently accessed data.

Scheduling

When multiple I/O requests are fired, the disk scheduler determines the order of processing. The goal is to minimize total access time and maximize throughput. This introduces a Scheduling Delay as the disk may need to wait for the current request to finish before processing the next one. Common scheduling algorithms include:

Solid State Drives

SSDs use flash memory (no mechanical parts), managed by a silicon controller and uses the same form factors as HDDs.

At the beginning of its life, an SSD is faster than an HDD because it has no seek time or rotation delay. However, as the SSD fills up and undergoes more write cycles, its performance can degrade, mainly for writes.

Data is organized into:

Each cell has a limited number of write cycles, leading to wear-out as the oxide layer degrades.

Each page can be in one of three states:

Writes always target empty pages; updates write to new pages and mark old pages dirty. Blocks containing only dirty pages can be erased. This prevents repeated wear on single cells but introduces challenges:

To mitigate wear-out, SSDs implement a technique called Wear Leveling that distributes write/erase cycles evenly across cells. Periodically relocates data to ensure \text{max cycles} - \text{min cycles} < e (small threshold).

SSDs uses Flash Translation Layer (FTL), a firmware component that manages the mapping between logical block addresses (LBAs) used by the operating system and the physical addresses of the flash memory. Mapping strategies include:

Reliability

Unrecoverable Bit Error Ratio (UBER) differs from HDDs: HDD UBER increases linearly with age while SSD UBER change over time, starting low, increasing as the drive wears out, and then rising sharply near the end of its lifespan.

Storage Architectures
Direct Attached Storage (DAS)

Direct Attached Storage is physically connected to a single server (internal or external via SATA, USB).

Network Attached Storage (NAS)

Network Attached Storage is storage that is connected to a network and has its own IP address, appearing as a file server. It provides file-level access to data over the network, allowing multiple clients to access and share files simultaneously.

Storage Area Network (SAN)

Storage Area Network is a network that provides block-level access to data, allowing servers to access storage as if it were directly attached.

Networking

The internal network connecting servers and storage within a data center is called Interconnect, or data center network (DCN). When scaling a data center the network might become the bottleneck of the system as all the servers need to communicate with each other.

The bandwidth capacity between the two halves of the network is indicated by the bisection bandwidth, which is the minimum bandwidth that must be available between two halves of the network to support full communication between them:

l \times L \geq C \times \frac{S}{2}

where:

The factor S/2 represents traffic flowing between halves. Traffic distribution must be balanced across links to prevent congestion.

The traffic within a DCN can be categorized into two types:

The network topology can be divided into three categories:

Switch-centric

Switch-centric architecture relies on switches to handle routing and forwarding. Switches manage the majority of the traffic, while servers focus on computation.

Three-Tier Architecture

The traditional approach uses a hierarchical topology with three layers:

graph TD
    subgraph Core Layer
        C1[Core Switch 1]
        C2[Core Switch 2]
    end

    subgraph Aggregation Layer
        A1[Aggregation Switch 1]
        A2[Aggregation Switch 2]
    end

    subgraph Access Layer
        S1[ToR Switch 1]
        S2[ToR Switch 2]
        S3[ToR Switch 3]
        S4[ToR Switch 4]
    end

    C1 <--> C2
    C1 --> A1
    C1 --> A2
    C2 --> A1
    C2 --> A2

    A1 --> S1
    A1 --> S2
    A2 --> S3
    A2 --> S4

To ensure high availability and fault tolerance, all the layers should have redundant connections, between upper layers and between the same layer.

Based on the bandwidth ratios between layers, the network can be classified as:

Leaf-Spine Architecture

A modern flat, two-layer topology:

graph TD
    subgraph Spine Layer
        S1[Spine Switch 1]
        S2[Spine Switch 2]
    end

    subgraph Leaf Layer
        L1[Leaf Switch 1]
        L2[Leaf Switch 2]
        L3[Leaf Switch 3]
        L4[Leaf Switch 4]
    end

    S1 --> L1
    S1 --> L2
    S1 --> L3
    S1 --> L4

    S2 --> L1
    S2 --> L2
    S2 --> L3
    S2 --> L4

If each leaf switch connects to every spine switch, the topology is called Clos Topology (Multi-stage Clos). This fully-interconnected leaf-spine variant enables any server to reach any other via a single hop. With k spine switches and n leaf switches:

This architecture uses homogeneous switches that are cheaper and easier to manage than hierarchical tiers. It also allows for scalability by adding leaf switches without architectural changes.

In the Clos topology, the connection is uni-directional, in case of a bi-directional connection is called Folded Clos.

It is possible to extend this architecture by grouping spine-leaf switches into pods and add a higher-level spine layer to connect the pods together, creating a Fat Tree topology.

Server-centric

A Server-centric architecture is a network design where servers themselves perform routing using their network interfaces (NICs) for switching operations. Servers handle both computation and networking tasks.

This increases the amount of hops between servers but allows more flexible and scalable network design, since servers can be added or removed without affecting overall network topology.

Hybrid

A Hybrid architecture combines elements of both switch-centric and server-centric designs. Some routing is performed by dedicated switches (especially for core/aggregation layers), while servers perform low-level routing. This approach enables trade-offs between network simplicity and flexibility.

Scale-Up Network (HPC Networks)

Designed for high-performance computing (HPC) workloads such as large-scale ML training. Resources are organized into Pods (groups of GPUs/accelerators) and the two type of communication is performed:

Building

Data centers allocate approximately 40% of space to IT equipment, the remainder supports cooling, power distribution, and failure recovery.

Cooling Systems

IT equipment generates significant heat, which must be dissipated to maintain optimal operating temperatures and prevent hardware damage. There are two main cooling approaches:

Data centers uses cold aisle/warm aisle configuration to maximize the air cooling efficiency:

The closed-loop approach includes:

Cooler system can include liquid cooling for high-power devices (GPUs, TPUs), offering superior thermal efficiency.

Power Usage

Data centers consume massive amounts of electricity, and only ~45% powers IT equipment. The remainder supports cooling, power conversion, and distribution overhead.

Power Usage Effectiveness (PUE): Metric measuring overall facility efficiency, lower is better: \text{PUE} = \frac{\text{Total Facility Power}}{\text{IT Equipment Power}} \geq 1

Data Center Infrastructure Efficiency (DCIE): The inverse metric: \text{DCIE} = \frac{\text{IT Equipment Power}}{\text{Total Facility Power}} \leq 1

In case of power failure, data centers must ensure continuous operation and graceful shutdown to prevent data loss and hardware damage. This is achieved through:

Data Center Availability

The availability of a data center is divided into four tiers:

RAID

RAID (Redundant Array of Independent Disks) is a data storage technology that combines multiple physical disk drives into a single logical unit to improve performance, increase storage capacity, and enhance reliability through redundancy.

This is done in a transparent way to the user, who sees a single logical disk, while the data is distributed across multiple physical disks.

This is done by striping data across multiple disks. As there are multiple disks, the read/write operations can be performed in parallel, improving performance.

Striping is performed by dividing data into strip units (dimension of the block) that are written into the stripe width (number of disks) in a round-robin fashion.

Bigger chunks reduces the seek time, but smaller chunks allows for more parallelism.

Increasing the number of disks increases the probability of failure, needs to use redundancy to ensure data integrity and availability.

It is possible to perform data reconstruction using the redundant information, but this process is time-consuming. Thanks to the parity information, it is possible to reconstruct the data of a failed disk by using the data from the remaining disks and the parity information.

RAID usually include a hot spare disk, which is a standby disk that can automatically replace a failed disk without manual intervention. When a disk fails, the RAID controller detects the failure and initiates the reconstruction process using the hot spare, minimizing downtime and ensuring data availability.

There are multiple RAID levels, each with different trade-offs between performance, redundancy, and storage efficiency:

RAID 0

In RAID 0, data is striped across multiple disks without any redundancy. This provides improved performance and capacity but no fault tolerance. If any disk fails, all data in the array is lost.

Disk 0 Disk 1 Disk 2 Disk 3
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
Block 8 Block 9 Block 10 Block 11

RAID 1

RAID 1 uses mirroring, where data is duplicated across two or more disks. This provides high reliability, as the system can continue to operate even if one disk fails. However, it has a storage efficiency of at most 50% since each piece of data is stored on multiple disks.

The read performance can be improved as the system can read from multiple disks in parallel.

No error correction is computed, so the write performance is similar to a single disk.

Disk 0 Disk 1
Block 0 Block 0
Block 1 Block 1
Block 2 Block 2

RAID 01

In RAID 0+1, data is first mirrored across multiple disks (RAID 1) and then stripped (RAID 0).

Disk 0 Disk 1 Disk 2 Disk 3
Block 0 Block 1 Block 0 Block 1
Block 2 Block 3 Block 2 Block 3
Block 4 Block 5 Block 4 Block 5

RAID 10

In RAID 1+0, data is first stripped (RAID 0) and then mirrored (RAID 0).

Disk 0 Disk 1 Disk 2 Disk 3
Block 0 Block 0 Block 1 Block 1
Block 2 Block 2 Block 3 Block 3
Block 4 Block 4 Block 5 Block 5

RAID 1+0 can tolerate multiple disk failures as long as they are not in the same mirrored pair, as losing a disk in RAID 0+1 would result in the loss of the entire array.

RAID 4

RAID 4 uses a dedicated parity disk to store parity information for error correction. Data is striped across multiple disks, and in the parity disk is stored the parity information computed as the XOR of the data blocks. This allows for data recovery in case of a single disk failure, but the dedicated parity disk can become a bottleneck for write operations.

Each update requires updating the parity information, which can lead to performance degradation, especially for write-intensive workloads.

Disk 0 Disk 1 Disk 2 Parity Disk
Block 0 Block 1 Block 2 P0 = B0 ⊕ B1 ⊕ B2
Block 3 Block 4 Block 5 P1 = B3 ⊕ B4 ⊕ B5
Block 6 Block 7 Block 8 P2 = B6 ⊕ B7 ⊕ B8

The update can be performed in two ways:

RAID 5

RAID 5 is similar to RAID 4 but distributes the parity information across all disks instead of using a dedicated parity disk. This improves write performance by eliminating the bottleneck of a single parity disk, while still providing fault tolerance for a single disk failure.

Disk 0 Disk 1 Disk 2 Disk 3
Block 0 Block 1 Block 2 P0 = B0 ⊕ B1 ⊕ B2
Block 3 Block 4 P1 = B3 ⊕ B4 ⊕ B5 Block 5
Block 6 P2 = B6 ⊕ B7 ⊕ B8 Block 8 Block 9
P3 = B9 ⊕ B10 ⊕ B11 Block 10 Block 11 Block 12

RAID 6

RAID 6 extends RAID 5 by adding an additional parity block, allowing for fault tolerance against two simultaneous disk failures. This is achieved by using two different parity calculations, such as XOR and Reed-Solomon codes, to provide redundancy.

This increases the storage overhead and can further degrade write performance due to the additional parity calculations.

Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
Block 0 Block 1 Block 2 P0 = B0 ⊕ B1 ⊕ B2 Q0 = RS(B0, B1, B2)
Block 3 Block 4 P1 = B3 ⊕ B4 ⊕ B5 Q1 = RS(B3, B4, B5) Block 5
Block 6 P2 = B6 ⊕ B7 ⊕ B8 Q2 = RS(B6, B7, B8) Block 8 Block 9
P3 = B9 ⊕ B10 ⊕ B11 Q3 = RS(B9, B10, B11) Block 11 Block 12 Block 13

Virtualization

In virtualization, the physical resources of a computer are abstracted and shared among multiple virtual machines (VMs) that run on top of a hypervisor or virtual machine monitor (VMM). Each VM operates as if it has its own dedicated virtual hardware, allowing for isolation and flexibility in resource allocation, providing the following characteristics:

A program works thanks to different levels of abstraction, each providing a different view of the underlying hardware.

Virtualization can be implemented at various levels of the software stack:

When the virtual machine’s ABI/ISA differs from the physical machine’s ISA, emulation is necessary to translate instructions from one architecture to another. This translation introduces performance overhead but enables running software compiled for different architectures.

The benefits of virtualization include:

Virtual Machine Managers

Hypervisor (or Virtual Machine Monitor - VMM) is the software layer that enables virtualization, handling tasks such as resource allocation, scheduling, hibernation, creation, and lifecycle management. It provides an abstraction layer between the physical hardware and the virtual machines. It can be classified into two types:

Virtualization Techniques

Paravirtualization

Paravirtualization reduces guest OS overhead by avoiding direct hardware interaction. Instead, the guest OS calls hypervisor functionalities via hypercalls. This approach requires modifying the guest OS to be aware of virtualization, but it improves performance by reducing the number of VM traps and hypervisor interventions compared to full virtualization.

graph TD
    App[Application] --> GuestOS[Guest OS]
    GuestOS -- Hypercall --> Hypervisor[Hypervisor]
    Hypervisor -- Operation --> Hardware[Physical Hardware]

Full Virtualization

Full virtualization allows unmodified guest operating systems to run on the hypervisor. The hypervisor uses a trapping mechanism:

  1. Guest OS executes an instruction intended for hardware.
  2. Hardware detects the instruction requires virtualization and traps to the hypervisor.
  3. Hypervisor intercepts the trap and emulates the instruction’s effect.
  4. Control returns to the guest OS.

This approach is transparent to the guest OS but introduces more hypervisor overhead than paravirtualization due to frequent trapping and emulation.

graph TD
    App[Application] --> GuestOS[Guest OS]
    GuestOS -- Private Instruction --> Hardware[Physical Hardware]
    Hardware -- Trap --> Hypervisor[Hypervisor]
    Hypervisor -- Emulate --> Hardware[Physical Hardware]

Containers

Containers are a lightweight form of virtualization that provides a pre-configured environment for applications, including code, dependencies, libraries, and configurations. Containers share the host OS kernel but run in isolated user spaces, allowing for efficient resource utilization and fast startup times.

The main advantages of containers include:

graph TD
    App[Application] --> Runtime[Container Runtime]
    Runtime --> HostOS[Host OS]
    HostOS --> Hardware[Physical Hardware]
    Runtime --> Hardware

Cloud Computing

Cloud computing provides on-demand access to a large-scale, publicly available collection of computing, storage, and network resources. Users access resources using a pay-as-you-go model, paying only for what they use. This eliminates the need to invest in and maintain physical infrastructure, enabling flexible, scalable IT operations.

This is achieved through a virtualization infrastructure that abstracts physical resources and delivers them as services over the internet, allowing managing, moving, and scaling resources without worrying about underlying hardware.

Service Models (X-as-a-Service)

Cloud services follow the “X-as-a-Service” model where customers focus on their business needs, not infrastructure management or code deployment. Services are abstracted and delivered through standard interfaces.

Cloud Deployment Models

Cloud services are deployed using different models depending on organizational requirements:

Dependability

Systems fail due to: defects, degradation, radiation, design errors, bugs, attacks, and human errors. This leads to economic losses, information loss, physical harm, and reputation damage.

Dependability is a measure of trust toward a system. It comprises five key attributes:

Fault-Error-Failure Chain

A fault is a defect or anomaly in a system.

When a fault is activated, it becomes an error, a deviation from correct operation.

If an error is not detected and corrected, it propagates and ultimately causes a failure, meaning that the system ceases to perform its intended function.

Dependability Approaches

Two primary techniques address dependability:

This is a tradeoff between cost (hardware, performance, and development), performance, and dependability. Design decisions depend on: technologies, requirements, context, and environment.

Reliability Metrics

Reliability follows an exponential failure model: R(t) = e^{-\lambda t}

where:

Mean Time To Failure (MTTF): expected time until first failure: \text{MTTF} = \int_0^\infty R(t) \, dt = \frac{1}{\lambda}

Mean Time To Repair (MTTR): expected time to detect, repair, and recover: \text{MTTR} = t_{\text{detect}} + t_{\text{repair}} + t_{\text{recover}}

Mean Time Between Failures (MTBF): expected time between consecutive failures in repairable systems: \text{MTBF} = \text{MTTF} + \text{MTTR}

Availability formula: A = \frac{\text{MTTF}}{\text{MTTF} + \text{MTTR}} = \frac{\text{uptime}}{\text{uptime} + \text{downtime}}

Failures In Time (FIT): number of failures per billion device-hours: \text{FIT} = \frac{10^9}{\text{MTBF}}

Component Lifecycle

A component experiences three phases during its operational lifetime:

System updates and new deployments risk introducing failures into production. Some common strategies mitigate this risk:

Reliability Block Diagram

The system structure is represented as a block diagram where each component is a block and links show dependencies.

A system functions if there exists at least one operational path from start to end.

Connections represent two reliability configurations:

Standby Redundancy: A redundant component remains idle until the primary component fails, then automatically activates. This approach approximately doubles the MTTF compared to a single component.

r-out-of-n Redundancy

A system that requires r out of n components to function correctly for the system to operate.

The system reliability for r-out-of-n redundancy (assuming identical components, each with reliability R):

R_{\text{voting}} = \sum_{i=r}^{n} \binom{n}{i} R^i(1-R)^{n-i}

This formula sums the probability that at least r components are operational.

When the majority of components must be operational, the reliability of a single component could be higher than the reliability of the entire system, especially when the failure rate \lambda is high.

Performance Evaluation

Performance of a computer system measures its overall effectiveness, including metrics such as throughput (tasks completed per unit time), latency (response time), and resource utilization.

The evaluation can be performed using three main approaches:

  1. Intuition Analysis: Informal assessment based on experience. Limited accuracy but fast.
  2. Experimental Evaluation: Empirical measurement through actual system tests, prototypes, and benchmarks. Highly accurate but expensive and time-consuming.
  3. Model-Based Approach: Create a simplified mathematical or simulation model of the system that captures key characteristics. Model predictions are cheaper and faster than real measurements while maintaining reasonable accuracy. Two main techniques:
    • Analytical/Numerical Methods: Use probability theory and stochastic process theory to derive mathematical solutions.
    • Simulation: Reproduce system behavior by executing traces of the model.

Queueing Networks

A Queueing Network Model is a commonly used performance model representing a system as a collection of interconnected service stations. Customers (requests, transactions, jobs) arrive at stations, wait in queues, receive service (server, resource), and may depart or route to other stations.

There are three main types of queueing networks based on customer flow:

When customers arrive at a station, they may have to wait if the server is busy. The order in which customers are served is determined by the queue discipline (scheduling policy):

Queue size can be finite (limited waiting space) or infinite (unlimited waiting space). Finite queues can lead to blocking and lost customers, while infinite queues can lead to unbounded wait times.

Customers move between service stations according to routing decisions:

Operational Laws

Operational Laws are fundamental equations that relate measurable system variables to derive performance metrics. They are based on observable variables and basic principles such as flow conservation and work conservation.

Flow Balance Principle: In equilibrium, A \approx C (arrivals equal completions). If A \neq C, the system is not in balance (building up queues or draining).

Residence Time Components:

From which we can derive:

Interactive Response Time:

Which leads to the Interactive Response Time formula:

R = \frac{N}{X} - Z

Visit Ratios and Resource Utilization:

Performance Bounds

The bounds of a system are defined by the bottleneck resource, which is the resource with the highest demand D_\text{max}.

The bounds can be expressed under two conditions:

Open System Bounds

In open system, the saturation point is reached when the arrival rate \lambda equals the throughput X:

The throughput bounds: 0 \leq X \leq \frac{U = 1}{D_\text{max}}

The response time cannot be upper bounded as it can grow indefinitely as the arrival rate approaches the saturation point.

The response time bounds: R \geq D

Closed System Bounds

In closed system, the saturation point is reached when the number of customers N equals the number of customers required to saturate the bottleneck resource:

N^* = \frac{D + Z}{D_\text{max}}

The throughput bounds: \underbrace{\frac{N}{ND + Z}}_\text{Pessimistic} \leq X \leq \underbrace{\min\left(\overbrace{\frac{1}{D_\text{max}}}^{Heavy}, \overbrace{\frac{N}{N - Z}}^\text{Light}\right)}_\text{Optimistic}

The response time bounds: \underbrace{\max(\overbrace{D}^\text{Light}, \overbrace{ND_\text{max} - Z})_\text{Heavy}}_\text{Optimistic} \leq R \leq \underbrace{ND}_\text{Pessimistic}

Ultima modifica:
Scritto da: Andrea Lunghi