Module name: Distributed Storage (DFS – Distributed file systems, Database)
The last module start here: https://ulife.ai/stories/everything-you-need-to-know-about-threading-with-python
Big is not just about computation. If we want to compute a huge amount of data, we need to find a way to store them so they can be easily accessible. And for that reason, we have many solutions like File systems and databases spread across computers. In this section, we will talk about distributed file systems like (HDFS, GFS, …), and Database partitioning and why you may need a NoSQL database for storing big data.
In this section, we will discuss how to store petabytes of data in multiple computers and make them always available and coherent. We will introduce Hadoop Distributed File Systems, Google File system (that power google drive), …
At the beginning of this book we have introduces the concept of big data and how huge the data we may want to store are. Let’s consider for example petabytes of data. Can we keep all that on a single machine? (Of course not). So we need a solution to store files on many computers.
There comes DFS. It can help to store extremely huge amounts of data in many computers in a way that you can access it at any time, without inconsistency, in an abstract way for the users so they don’t have to mind what is going on in the background, they just access the files as they will do in their local computer.
This then creates many challenges that we will address in the upcoming sections. We can cite for example:
In the DFS, files are split and saved in many servers and managed in a way that solved most of the challenges we had stated previously. And for that, we need a specific configuration. We must define some concepts first:
He is also responsible for the security access control of the different files.
In your local machine, your file is saved in some part of your local disk, and then you can just access it easily by specifying the location. (see example tree)
But when it comes to a DFS, when you ask for a file located at /home/user/some_file, for example, the process is way more complicated. And completely depends on the action you are trying to archive. We first have to point out some architectural basics:
The main problem with DFS is how your file is written on a distributed environment on many servers. Surprisingly the process is simple. When you have a large file, you split it into parts
And each chunk is saved on a chunk server. We have the master server defined previously which is supposed to know where each chunk is located so we can easily access them.
The complète process is described in the up. The client contacts the master node with information about the file chunk he is trying to write on the DFS. The master nodes respond with the set of chunk nodes where that can be saved by clearly specifying the primary replica.
The client can now contact the primary replica and send him the information about the file to save. The main replica saves the chunk. In the background, he synchronised with the other servers to save a copy of the file on them. So an internal copy is done between chunk nodes and after the operation is completed, they contacted the master node to notify him and the master node can save that metadata. In the end, if everything was done and the chunk has been successfully saved on all the chunk servers, the client receives a done message from the primary replica. [See image]
NB: Writing operations can be efficient if the chunk is added at the end or challenging if we write in the middle
The process for reading a file in a Distributed environment is organised like this.
The client will ask the master node for information about the path you are trying to access.
The master answer back with the servers where are located the file and the client can then read the content he is trying to read. The process is straightforward.
Source: Big data analytics, Uni-Hildesheim
Let’s see the architecture of 2 DFS: GFS (Google file system) and HDFS(Hadoop distributed file storage)
GFS
HDFS
HDFS | GFS | |
Chunk size | 128Mb | 64Mb |
Default replicates | 2 Files | 3 Chunks nodes |
Master | NameNode | GFS Master |
Chunk Nodes | DataNode | Chunk Server |
We have the knowledge and the infrastructure to build, deploy and monitor Ai solutions for any of your needs.
Contact us