Developing Pragmatic Data Pipelines using Apache Airflow on Google Cloud Platform
Sameer Shukla1
Section:Research Paper, Product Type: Journal Paper
Volume-10 ,
Issue-8 , Page no. 1-8, Aug-2022
CrossRef-DOI: https://doi.org/10.26438/ijcse/v10i8.18
Online published on Aug 31, 2022
Copyright © Sameer Shukla . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
View this paper at Google Scholar | DPI Digital Library
How to Cite this Paper
- IEEE Citation
- MLA Citation
- APA Citation
- BibTex Citation
- RIS Citation
IEEE Style Citation: Sameer Shukla, “Developing Pragmatic Data Pipelines using Apache Airflow on Google Cloud Platform,” International Journal of Computer Sciences and Engineering, Vol.10, Issue.8, pp.1-8, 2022.
MLA Style Citation: Sameer Shukla "Developing Pragmatic Data Pipelines using Apache Airflow on Google Cloud Platform." International Journal of Computer Sciences and Engineering 10.8 (2022): 1-8.
APA Style Citation: Sameer Shukla, (2022). Developing Pragmatic Data Pipelines using Apache Airflow on Google Cloud Platform. International Journal of Computer Sciences and Engineering, 10(8), 1-8.
BibTex Style Citation:
@article{Shukla_2022,
author = {Sameer Shukla},
title = {Developing Pragmatic Data Pipelines using Apache Airflow on Google Cloud Platform},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {8 2022},
volume = {10},
Issue = {8},
month = {8},
year = {2022},
issn = {2347-2693},
pages = {1-8},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=5509},
doi = {https://doi.org/10.26438/ijcse/v10i8.18}
publisher = {IJCSE, Indore, INDIA},
}
RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v10i8.18}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=5509
TI - Developing Pragmatic Data Pipelines using Apache Airflow on Google Cloud Platform
T2 - International Journal of Computer Sciences and Engineering
AU - Sameer Shukla
PY - 2022
DA - 2022/08/31
PB - IJCSE, Indore, INDIA
SP - 1-8
IS - 8
VL - 10
SN - 2347-2693
ER -
VIEWS | XML | |
1384 | 1744 downloads | 382 downloads |
Abstract
Data Pipeline[1][2] is a series of actions which moves data from the one source to the destination, the complexity of Data Pipeline varies from use-case to use-case. The traditional data pipeline cleanups the data, aggregates the data and move it from one place to another, it sounds simple but it’s very complex as the organization deals with huge and complex data and the expectation from pipeline is that it should be robust, fast, notify about the status and it should do the same task repeatedly without failing. The modern data pipelines are slightly different in nature they are supposed to deal with Petabytes of data, they stores the data in various flavors of the cloud, should provide real-time data analysis. Apache Airflow is one such tool which simplifies the entire Data Pipeline creation to a great extent and the only pre-requisite is the basic Python Knowledge. This paper focuses on the stock-exchange data pipeline creation by using the Airflow concepts such as DAGs and Operators.
Key-Words / Index Term
Data-Pipeline, Python, Pandas, Seaborn, Apache-Airflow, GCP, Kaggle.
References
[1] P. Covington, J. Adams and E. Sargin, "Deep neural networks for youtube recommendations", Proceedings of the 10th ACM conference on recommender systems, pp. 191-198, 2016.
[2] H. H. Olsson and J. Bosch, "From opinions to data-driven software r&d: a multi-case study on how to close the’open loop’problem", 2014 40th EUROMICRO Conference on Software Engineering and Advanced Applications, pp. 9-16, 2014.
[3] Panos Vassiliadis, ‘A Survey of Extract-Transform-Load Technology.,’ July 2009 International Journal of Data Warehousing and Mining 5:1-27
[4] Tziovara, V., Vassiliadis, P., & Simitsis, A. (2007). Deciding the Physical Implementation of ETL Work-?ows. Proceedings ACM 10th International Workshop on Data Warehousing and OLAP (DOLAP 2007), pp. 49-56, Lisbon, Portugal, 9 November 2007.
[5] Vassiliadis, P., & Simitsis, A. (2009). Extraction-Transformation-Loading. In Encyclopedia of Da-tabase Systems, L. Liu, T.M. Özsu (eds), Springer, 2009.
[6] Florian Waa, Tobias Freudenreich, Robert Wrembel, Maik Thiele, Christian Koncilia, Pedro Furtado, ‘OnDemand ELT Architecture for Right-Time BI: Extending the Vision’, International Journal of Data Warehousing and Mining 9(2):21-38 · April 2013
[7] FabianPrasser, HelmutSpengler, RaffaelBild, JohannaEicher, Klaus A.Kuhn, ‘Privacy-enhancing ETLprocesses for biomedical data’, International Journal of Medical Informatics, Vol.126, pp.72- 81, June 2019.
[8] Ibrahim Burak Ozyurt and Jeffrey S Grethe, ‘Foundry: a message-oriented, horizontally scalable ETL system for scientific data integration and enhancement’, Database (Oxford). 2018; 2018: bay130.C. Wohlin, P. Runeson, M. Host, M. Ohlsson, B. Regnell, ¨ and A. Wesslen. ´ Experimentation in Software Engineering. Computer Science. Springer, 2012.
[9] Venters, W., Whitley, E.A.: A Critical Review of Cloud Computing: Researching Desires and Realities. J. Inf. Technol. 27, 179–197, 2012.
[10] Justin, C., Ivan, B., Arvind, K. and Tom, A. “Seattle: A Platform for Educational Cloud Computing”SIGCSE09, March 37, 2009, Chattanooga, Tennessee, USA. 2009.
[11] Google Apps Education Edition: communication, collaboration, and security in the cloud.http://www.google.com/a/edu/
[12] Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust,Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael JFranklin, et al. Apache spark: a uni?ed engine for big data processing. Commu-nications of the ACM, 59(11):56–65, 2016.
[13] Creating Data Pipelines using Apache Airflow "Sameer Shukla" Volume 9 - Issue 4 International Journal of Computer Techniques (IJCT) ,ISSN:2394-2231 , www.ijctjournal.org
[14] S. Fortune, J. Hopcroft, J. Wyllie The directed subgraph homeomorphism problem Theoret. Comput. Sci., 10, pp. 111-121, 1980.
[15] C.L. Lucchesi, M.C.M.T. Giglio, On the irrelevance of edge orientations on the acyclic directed two disjoint paths problem, IC Technical Report DCC-92-03, Universidade Estadual de Campinas, Instituto de Computação, 1992.
[16] Y. Perl, Y. Shiloach Finding two disjoint paths between two pairs of vertices in a graph J. ACM, 25, pp. 1-9, 1978.
[17] R. Agrawal and R. Srikant, "Mining Sequential Patterns", Proc. Int`l Conf. Data Eng. (ICDE `95), pp. 3-14, 1995.
[18] J. Chen and K. Xiao, "BISC: A Binary Itemset Support Counting Approach Towards Efficient Frequent Itemset Mining", ACM Trans. Knowledge Discovery in Data..
[19] Vassiliadis, P., Simitsis, A., Georgantas, P., Ter-rovitis, M., & Skiadopoulos, S. (2005). A generic and customizable framework for the design of ETL scenarios. Information Systems, 30, 7, 492-525, 2005.
[20] P. Merle, O. Barais, J. Parpaillon, N. Plouzeau and S. Tata, "A Precise Metamodel for Open Cloud Computing Interface", the 8th International Conference on Cloud Computing (CLOUD). IEEE, pp. 852-859, 2015.
[21] D. C. Schmidt, "Model-Driven Engineering", COMPUTER-IEEE COMPUTER SOCIETY-, vol. 39, no. 2, pp. 25, 2006.
[22] Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing.
[23] Zimmermann, O. (2009). An architectural decision modeling framework for service oriented architecture design. PhD thesis, Universitat Stuttgart.
[24] Badidi, E. (2013) “A Framework for Software-As-A-Service Selection and Provisioning”. In: International Journal of Computer Networks & Communications (IJCNC), 5(3): 189-200, 2013.
[25] F. Montesi and J. Weber, “Circuit Breakers, Discovery, and API Gateways in Microservices,” ArXiv160905830 Cs, Sep. 2016
[26] G. Grahne and J. Zhu, "Efficiently Using Prefix-Trees in Mining Frequent Itemsets", Proc. Workshop Frequent Itemset Mining Implementations (FIMI `03), 2003.
[27] Z. Zhang and M. Kitsuregawa, "LAPIN-SPAM: An Improved Algorithm for Mining Sequential Pattern", Proc. Int`l Special Workshop Databases for Next Generation Researchers, pp. 8-11, 2005.