Big data refers to massive volumes of data that can’t be stored and/or processed using the simple Database Management System approach within a specific time frame. It refers to any data that is in petabytes or greater in memory size that causes drawbacks in storing, analysing and visualizing the data, i.e. Terabytes, Exabytes, Zetabytes etc. Its volume outstrips the tools to store it or even process it. This data is non-transactional and can be user generated or machine generated.
Big data architecture, a foundation for big data analytics, is a result of the interaction of big data application tools. These tools or database technology are put together to achieve high performance, high fault tolerance and scalability. It is dependent upon tools that an organization already has in place and also on the data environment an organization has.
A big data architecture is designed to handle the ingestion, processing and analysis of data that is too large or complex for simple traditional database systems. Big data Solutions usually involve batch processing of big data sources (at rest), real-time processing of big data (in motion), interactive exploration of big data and predictive analytics and machine learning.
A lot of big data architectures include some or all of the following components;
°Data source: Can be one or more, e.g static data store(relational databases), static files produced by applications (web server log files) and real-time data source(IoT devices)
°Data storage: Batch processing operations’ data is usually stored in a distributed file store that is able to hold high volumes of data files in different formats known as a data lake. E.g the Azure data lake store from Microsoft.
°Batch processing: A big data solution must process data using long running jobs to filter, aggregate and prepare data for analysis. Since the data setse are very large, thos involves reading source files, processing them and writing output to new files.
°Real-time message ingestion : the architecture must include a way to capture and store real-time messages for stream processing only if the solution involves real – time sources.
°Analytical Data Store: Solution should prepare data for analysis and serve the processed data in a structured format that can be queried with the use of analytical tools. A Kimball-style relational data warehouse can be used to serve these queries as the data store.
° Orchestration: Orchestration technology like Azure Data/Apache Oozie and scoop, can be used for big data for solutions that range from repeated data processing operations to loading the processed data into an analytical data store and eventually pushing the results directly into a report/dashboard.