Redundancy and Fault Tolerance
The SmartMEDIA architecture allows to create the IPTV/OTT broadcasting system without a single point of failure and with guaranteed high availability of content and services. Depending on business objectives, budget and other factors, the system administrator can design an optimal redundant architecture.
SmartMEDIA allows you to build the redundant system in two main ways:
- Building fault-tolerant clusters allows you to achieve high availability of the service by reserving individual components (storage, Conveyor services, playlist generators, HTTP servers). Failure of any of the components does not lead the cluster to fail and is invisible to the end user. Usually, it is used for reservation on the one site scale (data center). Below are typical solutions description that will help you to understand the SmartMEDIA architecture better and choose the most suitable clustering scheme.
- Load balancing and redundancy by redirecting the client to the nearest working server/cluster. Unfortunately, even the most reliable clusters can not provide 100% availability of services — the “human factor” still remains (for example, the administrator removes data from the cluster storage) , the probability of one of the fail-safe components failure (for example, an index database cluster failure) or a whole data center failure (power outage, communication failure, etc.). Improving service availability can be achieved by switchover between individual servers, clusters or entire data centers.
Building Failover Clusters
Redundancy of VoD Services
The VoD broadcasting service redundancy can be implemented by providing fault tolerance for files storage and data delivery over HTTP. Due to triviality of this task, it will not be described in detail in this document.
The following things should be considered when building the system:
- Failures during the content preparation stage are possible. For example, failure can occur during the transcoding or segmentation and the resulting file will be recorded partially.
- So, it is necessary to provide the content verification after each preparation step (transcoding, segmentation by the vodconvert utility) before writing to the storage/storages. Content duration for each track should be checked, as well as the list of tracks.
- Otherwise, the client may receive incomplete content or the content not suitable for playback;
- Content should be verified after writing to each of the storages;
- Using of several separate storages is more reliable than using one “cluster” storage because the probability of a complete cluster failure is remained, as well as the probability of data corruption in the clustered storage due to external factors.
- Using of multiple storages requires more careful verification of data on each of the storages. In some cases, such solution is more complex in terms of load balancing between storages as well.
Redundancy of LiveTV-based Services (Live, nDVR, TSTV, etc.)
The nDVR service provisioning system is distinguished by a more complex architecture and, as a result, more complex backup schemes. Examples of various solutions, their benefits and drawbacks will be discussed below.
Solution 1: Record and Storage Servers Duplication
The solution is to record each channel to several independent servers. Each of them runs all the services and components necessary for the Live/nDVR services functioning. The data is stored in the local POSIX storage.
Because data recording on different servers is not synchronized, the content can be segmented in different moments, media chunks will differ and you cannot get the same chunk from different servers — that’s why you cannot use simple load balancing in this case. Redundancy is provided by redirecting initial requests for playlist/manifst to a working server using HTTP Redirector.
The disadvantage of such a solution is that in the event of a failure of any service or server, content delivery to the subscriber will be interrupted until the content will be re-requested from SmartMEDIA Redirector. In this case, SmartMEDIA Redirector will redirect the client to a working server after the failure detection on one of the servers.
Solution 2: Using S3 Object Storage
Recording services redundancy — 2N (Active/Standby); other services redundancy — N+M (Active/Active)
The solution is to provide recording servers redundancy in Active/Standby mode. Each servers’ pair works with its own group of TV channels.
Recording and storage of nDVR content is performed on an external service (local or cloud — for example, Amazon S3), compatible with the S3 protocol.
In this architecture, all the recording and streaming servers in the cluster share the same copy of the index database. Its relevance on all nodes is maintained by the MongoDB internal mechanisms.
Because all the streaming servers of the same cluster (“Video Server” in the diagram) have access to the same copy of the content (via S3 protocol) and to the same index database, load balancing between them can be implemented both by the SmartMEDIA HTTP Redirector service and standard HTTP load balancing tools (e.g. Virtual IP addresses + DNS RoundRobin, LVS, hardware HTTP balancer and others).
Each content delivery server (Video Server) is both a playlist generation server and an HTTP access proxy to the S3 storage: the HTTP reverse proxy requests a playlist from the Playlist Generator component and returns it to the device; after that it starts to send the nDVR content from the S3 storage system to the subscriber.
Solution 3: Using an S3 Object Storage and Redundant SmartMEDIA Services (All-in-One)
Recording services redundancy — 2N (Active/Standby); other services redundancy — N+M (Active/Active)
Solution 3 differs from the previous one by moving of recording servers (primary and backup) from individual servers to delivery servers. This solution reduces the cost of equipment, but increases the load on media servers and increases the requirements for the stability of data transmission over the network (losses, jitter, etc.).
Solution 4: Using a Ceph (S3) Object Storage, Combined With the Record Service
Recording services redundancy — 2N (Active/Standby); other services redundancy — N+M (Active/Active)
The Ceph is an example of a relatively inexpensive and reliable storage solution with S3 access. Other solutions, such as SwiftStack and others, can also be used as storage. |
The Ceph cluster (not included in the SmartMEDIA product) can consist of any number of storage servers (“MSTORx” in the diagram). The local disks of each server are joined into a single cluster, which provides a sufficiently high level of nDVR content storage reliability. The load between storages is balanced by Ceph internal mechanisms.
The SmartMEDIA Conveyor component (primary or backup) can share the hardware with the Ceph storage. If the primary service fails, the live streams recording switches to the standby one. Some servers in the cluster can only perform the nDVR content storage function.
Similar to the “Solution 2”, the load on playlists generation and nDVR content broadcasting is distributed among several video servers running in parallel.
Number of SmartMEDIA Conveyors’ pairs can vary — as well as the number of Ceph storage servers.
Load Balancing
Uniform and flexible load balancing between multiple video servers or their groups can be provided by the SmartMEDIA Redirector component, which:
- allows you to define server groups and load balancing rules within the group and between groups, depending on the location of the client and the requested content;
- redirects subscriber requests to working servers if one or more source servers fail;
- monitors the servers’ performance;
- checks the content availability.
IP addresses of subscriber devices are grouped into several subnets. Each of subnets can be assigned to and serviced by a group of video servers. Subnets allocation and their service procedure maintenance is performed by the system administrator taking into account the network organization, number of subscribers, etc. To add new video servers under the SmartMEDIA Redirector control, you must specify them in the Redirector’s configuration.
The component has two services:
- for protocols based on HTTP (HLS, DASH, SmoothStreaming) — httpRedirector;
- for the RTSP protocol — rtspRedirector.
The simplified diagram of HTTP Redirector workflow is shown below:
SmartMEDIA Redirector accepts content viewing requests from subscriber devices, identifies video servers with the best content access conditions and redirects subscribers to these video servers.
When the request is received, Redirector determines the group of video servers (so-called “farm”), that processes requests from devices on this subnet. Video servers of each group process incoming requests in accordance with the specified load balancing method (see below). In general, Redirector selects the next video server from the group and checks the requested content availability in its repository:
- if the content exists, Redirector forwards the request to the selected video server;
- otherwise Redirector switches to the next video server in the group;
- if none of the servers in the group can process the request, redirector tries to determine another most suitable group;
- if no group can process the request, an error is returned to the subscriber device. Information about the cause of the error is recorded in the log file.
Selection of the request processing server within the group are performed according to one of the following policies:
- On the first server responded. If some server is underloaded, it will respond faster than others and the probability of its selection increases.
- By content ID. In this case, the content (TV channel, VoD, etc.) is “anchored” to the serving server. This increases the chance of cache using on this server, but may result in uneven load distribution between servers.
In addition to the balancing policy, a “weight” parameter can be assigned to each server in the group. When all other conditions are equal, the request will be redirected to the server with a higher weight.
SmartMEDIA Redirector monitors the status of video servers, periodically sending requests to prevent the request redirection to a failed video server. Information about the unavailability of the server or content is cached for a specified time. If the server or content is unavailable, the server is re-checked after some configured time. When the operability of the video servers is restored, HTTP Redirector begins to distribute requests taking into account the changed state of video servers.
Redundancy and Load Balancing with HTTP Redirector
The HTTP Redirector service is designed to provide redundancy of streaming servers, balance the load between them and optimize traffic flows from the platform to subscribers. It allows you to configure server groups (farms) and the weight of each server within the group. Also you can assign your own balancing policy for the group.
The redirections section within the farms array defines farms with their names (identifiers) and servers list.
The locations array defines subnets lists and a farms list in the order in which the content and servers availability will be polled.
An example of the configuration with requests distribution between sites is given below:
{
“core”:{
“port”:9201,
“interface”:“0.0.0.0”
},
“redirection”:{
“vod_range_cache_time_sec”:20,
“request_timeout_sec”:5.0,
“channel_regex”:“([^/]+)/.*”,
“server_status_cache_time_sec”:30.0,
“uri_path”:“/”,
“live_cache_time_sec”:10,
“clear_cache_interval_sec”:60,
“channel_regex_index”:1,
“range_cache_time_fraction”:0.05,
“farms”:[
{
“name”:“site6”,
“servers”:[
{ “url”:“http://10.72.10.18:80/” },
{ “url”:“http://10.72.10.110:80/” }
]
},
{
“name”:“site1”,
“servers”:[
{ “url”:“http://10.72.10.50:80/” },
{ “url”:“http://10.72.10.106:80/” }
]
},
{
“name”:“default”,
“servers”:[
{ “url”:“http://10.72.5.4:80/” },
{ “url”:“http://10.72.5.5:80/” },
{ “url”:“http://10.72.5.8:80/” }
]
}
],
“locations”:[
{
“farms”:[ “site6”, “default” ],
“uri_regex”:“.*”,
“ip_mask”:[
“172.21.128.0/17”,
“172.22.0.0/17”,
“172.27.0.0/20”
]
},
{
“farms”:[ “site1”, “default” ],
“uri_regex”:“.*”,
“ip_mask”:[
“172.24.112.0/20”,
“172.24.192.0/19”,
“172.25.0.0/18”
]
},
{
“farms”:[ “default” ],
“uri_regex”:“.*”,
“ip_mask”:[
“10.65.1.97/32”,
“172.16.0.0/12”,
“0.0.0.0/0”
]
}
]
}
}