"Big data" usually refers to data sets that are so large that they are difficult to collect, process, and analyze, as well as data that is stored in traditional infrastructures for long periods of time. The term "big" has several meanings; it describes the size of an organization, and more importantly, it defines the size of the IT infrastructure in an enterprise. The industry has high expectations for the use of big data, and the more business information that accumulates, the more value it has, but we need a way to tap into that value.
Perhaps the image of Big Data comes mainly from the cheapness of storage capacity, but in reality, organizations are creating more and more data every day, and people are struggling to find valuable business intelligence from this vast amount of data. On the other hand, users also save data that has already been analyzed, because this old data can be compared to new data collected in the future, and is still potentially exploitable.
Why Big Data? Why now?
In addition to having the ability to store larger volumes of data than ever before, we are dealing with more types of data. The sources of this data include online transactions, online social activities, automated sensors, mobile devices, and scientific instruments, to name a few. In addition to those fixed sources of data production, various transactional behaviors may accelerate the rate of data accumulation. For example, the explosion of social multimedia data stems from new online transactions and recording behaviors. Data is always growing, but the ability to store huge amounts of data is not enough to ensure that we can successfully search for business value from it.
Data is an important factor of production
In the information age, data has become as important a factor of production as other factors such as capital, labor, and raw materials, and is no longer confined to specific industries as a general need. Companies in all industries are collecting and analyzing large amounts of data to minimize costs, improve product quality, increase productivity, and create new products. For example, analyzing data collected directly from product testing sites can help companies improve their designs. In addition, a company can outperform its competitors by y analyzing customer behavior and comparing large amounts of market data.
Storage technology must keep up
As the use of big data has exploded, it has spawned its own unique architecture, and it has also directly driven the development of storage, networking, and computing technologies. After all, dealing with the specific needs of big data is a new challenge. Hardware development is ultimately driven by software requirements, and in this case it's clear to see that big data analytics application requirements are influencing the development of data storage infrastructure.
On the other hand, this change is an opportunity for storage vendors and other IT infrastructure vendors. As the volume of structured and unstructured data continues to grow, and as the sources of analytics data become more diverse, previous storage system designs are no longer able to meet the needs of big data applications. Storage vendors have realized this, and they are beginning to modify the architectural design of block- and file-based storage systems to accommodate these new requirements. Here, we'll discuss which attributes are relevant to big data storage infrastructures and see how they are meeting the challenges of big data.
Capacity Issues
What we mean by "big capacity" here is often in the petabyte range, so big data storage systems must scale accordingly. At the same time, the expansion of the storage system must be easy, can be added through the addition of modules or disk enclosures to increase capacity, even without downtime. Based on this demand, customers are now more and more in favor of Scale-out architecture storage, Scale-out cluster structure is characterized by each node in addition to a certain amount of storage capacity, internal data processing capabilities and interconnection equipment, and the traditional storage system chimney architecture is completely different, Scale-out architecture can achieve seamless and smooth expansion, to avoid storage islands. Silos.
Big Data applications have a huge number of files in addition to a huge scale of data. Therefore, how to manage the metadata accumulated in the file system layer is a difficult problem, and improper handling will affect the scalability and performance of the system, and the traditional NAS system has this bottleneck. Fortunately, object-based storage architecture does not have this problem, it can be managed in a system of a billion level of the number of files, but also will not be like traditional storage metadata management problems. Object-based storage systems also have wide-area scalability, allowing them to be deployed in multiple locations and form a large storage infrastructure across regions. [page] Latency issues
"Big Data" applications also have real-time issues. This is especially true for applications related to online transactions or financials. For example, an online advertising service in the apparel industry needs to analyze customer browsing history in real time and place advertisements accurately. This requires that the storage system must be able to support the above features while maintaining a high response time, because the result of response delay is that the system will push "outdated" advertisement content to the customer. In this scenario, the scale-out architecture of the storage system can play an advantage, because each of its nodes have processing and interconnection components, in the increase in capacity at the same time processing power can also be synchronized growth. Object-based storage systems, on the other hand, are able to support concurrent data streams, thus further increasing data throughput.
There are many "big data" application environments that require high IOPS performance, such as HPC high performance computing. In addition, the proliferation of server virtualization has led to the need for high IOPS, just as it has transformed traditional IT environments. To meet these challenges, various models of solid state storage devices have emerged, ranging from simple caching inside servers to large scalable storage systems with all solid state media, and so on are booming.
Concurrent access Once organizations recognize the potential value of big data analytics applications, they incorporate more data sets into their systems for comparison, as well as allowing more people to share and use the data. In order to create more business value, enterprises tend to analyze those multiple data objects from under different platforms in an integrated manner. Storage infrastructures, including global file systems that allow multiple users on multiple hosts to concurrently access file data that may be stored on many different types of storage devices in multiple locations, can help solve the data access problem.
Security Issues
Certain industry-specific applications, such as financial data, medical information, and government intelligence, have their own security standards and confidentiality requirements. While these are no different for IT managers and are mandatory to follow, big data analytics often require cross-referencing of multiple types of data, which in the past would not have been accessed in such a mixed way, so big data applications have also spawned some new security issues that need to be considered.
Cost issues
"Big" can also mean costly. And for those organizations that are using a big data environment, cost control is a key concern. Trying to control costs means that we need to make every piece of equipment more "efficient" and reduce the number of expensive components. Technologies such as deduplication have entered the primary storage market and can now handle a wider range of data types, all of which can bring more value to big data storage applications and improve storage efficiency. In an environment where data volumes are growing, a significant return on investment can be achieved by reducing back-end storage consumption by even a few percentage points. In addition, the use of automated thin provisioning, snapshots, and cloning technologies can also improve storage efficiency. [page] Many big data storage systems include an archiving component, which is essential, especially for organizations that need to analyze historical data or need to preserve data over time. Tape remains the most economical storage medium from a cost-per-unit-capacity storage perspective, and in fact, the use of archiving systems that support terabyte-class high-capacity tapes remains the de facto standard and practice in many organizations.
The factors that have the greatest impact on cost control are those of commercially available hardware devices. As a result, many first-time users, as well as those with the largest applications, are customizing their own "hardware platforms" rather than using commercial off-the-shelf products, a move that can be used to balance their cost-containment strategies as they expand their businesses. In response to this demand, more and more storage products are now being offered in a software-only format that can be installed directly onto a user's existing, generic or off-the-shelf hardware devices. In addition, many storage software companies are selling software products as the core of the integrated hardware and software devices, or allied with hardware vendors to launch cooperative products.
Accumulation of data
Many big data applications involve regulatory compliance issues, which often require data to be kept for years or decades. For example, medical information is usually meant to keep patients alive, while financial information is usually kept for seven years. Some users of big data storage, on the other hand, want data to be kept for longer periods of time because any data is part of the historical record and data is mostly analyzed based on time periods. Achieving long-term data retention requires storage vendors to develop features that enable continuous data consistency checking and other features that ensure long-term high availability. It is also important to realize the functional requirements for updating data directly in place.
Flexibility
Big data storage systems are often very large in infrastructure size, and therefore must be carefully designed to ensure the flexibility of the storage system to scale and expand with application analytics software. In a big data storage environment, there is no longer a need for data migration, as data is kept at multiple deployment sites simultaneously. A large data storage infrastructure is very difficult to adapt once it is up and running, so it must be able to adapt to a variety of different application types and data scenarios.
Application-awareness
The earliest group of users of big data have developed infrastructure that is customized for applications, such as systems developed for government programs and dedicated servers created by large ISPs. The use of application-aware technologies is becoming more common in the mainstream storage systems space, and it is also an important means of improving system efficiency and performance, so application-aware technologies should be used in big data storage environments as well.
What about small users?
It's not just those particular large user groups that rely on big data; as a business requirement, small businesses are bound to apply big data in the future as well. We have seen that some storage vendors are already developing small "big data" storage systems that appeal to cost-sensitive users.