stackArmor was contracted to modernize Amtrak’s Sales Data Repository (SDR) using Amazon Web Services (AWS). The work performed included designing and implementing a data lake and processing platform integrating IBM mainframe and heterogeneous data sources.
The Data Modeling and implementation exercise began with creating a DDL (Data Definition Language) schema using a tool such as Toad Edge or ERWin that can be deployed in the AWS Aurora cluster for Amtrak. The actual implementation phase required coordinating with Amtrak’s infrastructure team to help create a development environment on AWS. Key elements to developing the environment included the following steps:
- AWS environment setup including a VPC, Access Gateway, and connectivity to the VPC from the Amtrak network.
- Code repository and build process using in-built services such as AWS CodeDeploy, CodeBuild, and CodeCommit.
- AWS Data Lake to develop the core data catalog and data ingest capability including configuration of S3 buckets for deploying and staging in-bound datasets.
- Data model in AWS Aurora and Redshift Deliver logical and physical data model for SDR (OLTP) and EDW (Redshift (OLAP)) to store Ticket, Reservation, Sales, and Train Schedule related data.
- Configuring the AWS Aurora cluster based on the required performance requirements.
- For example, since read/write speeds are critical for Amtrak’s SDR, a multi-read replica approach with caching can help alleviate performance and response times.
- A data archival policy is configuration using Amazon Machine Images (AMI), Relational Database Services (RDS), and Elastic Block Store (EBS), with snapshots stored on S3. Based on an agreed upon aging cycle, the data is migrated to AWS Glacier from the S3 buckets.
Data Integration and ETL:
As part of the development effort, the data integration, and loading activities, we will perform the following work activities as described below.
- Configured and setup the AWS Glue (ETL) jobs profile and data catalog for ingesting the input files. These include daily batch file integration process using Glue service and load in Aurora and Redshift/RTDS/SAP GL.
- Setup the AWS Kinesis service to deliver real time data integration process from Arrow MQ to SDR and S3. All of the raw transactional data will be logged maintained in the Amtrak Data Lake for tagging and reprocessing if needed.
- Wrote AWS Lambda functions to help trigger the various transformation and ETL jobs.
- Developed data artifacts for reconciliation, QA, and testing purposes covering the following real time data integration process using AWS service for read and write.