Search Results | Pingahla

Blog Posts (34)

Other Pages (53)

34 results found with an empty search

What is Talend CDC? - Demonstration of a Standard Replication
Talend CDC (Change Data Capture) is a tool that performs a data integration process that replicates and syncs data from one source to different targets in real-time. In a lot of business scenarios, it is required to keep databases synchronized, and there are many of those use cases that can be solved with Talend CDC, such as: Keeping the DWH aligned with the latest data from transactional sources Hybrid architecture needs to keep the cloud environment aligned with the on-premises data environment A very performance-consuming replication system that needs to perform complex queries on specific columns of our data to find out if it is a new, changed, or deleted record. A transactional database is used as a reporting source, thus affecting its performance. Having an online backup Having a distributed database CDC can be used in different environments: on-premises, hybrid, or cloud, and it has connectors to different cloud providers such as Snowflake, AWS, or Azure, and managed services such as Kafka, AWS kinesis or Azure Eventhub. It also has connectors and support for on-premises servers such as IBM AS400, which is frequently used by financial entities and makes it unique in the market. How does Talend CDC work? CDC identifies the different changes that are in the source and replicates them in the target source. The replicated data comes from the different operations carried out such as: INSERT DELETE UPDATE The data sources that are producing changes can store logs with the transaction events that have been carried out on the data. These logs can be in different formats, varying according to the manufacturer. The CDC agent works by monitoring and collecting those transactional events once, to store them in its own log files (Journal), in case it’s necessary to reuse them in different replication. However, this process is oblivious to system resources and does not interact with them. In cases where we already have a database created and populated as a source, we will need to only do one full replication. From that point forward, Talend CDC will capture the changes made. It’s possible to apply scripts to the data in the source to select, transform or perform aggregations as well as apply changes to the target prior to integration. Its configuration is done in a graphical console in which we select our source and target. It operates with different roles as: Administrator Operator Viewer The replicas are created in an environment, and there will be called models; a model contains the sources and targets; Once configured, we can add additional options for these, such as scheduling the executions, changing the framework with which we integrate, or supervising the job through logs in the tool or log files. There are different frameworks with which CDC can do data replications, and each of them has additional capabilities, such as adding columns in the target, providing transformations, adding sequence identifiers, or even working with big data targets. Demo of a Standard Replication Initial Settings In this blog, I will show a standard replica, so we will select the framework as “Free”, which means that no additional columns or data will be added to what is already in the source; I will be using Talend Change Data Capture version V 7.15 .0. For this example, I have already created the source and target databases in SQL Server, which I named "training_source" and "training_target" respectively; in the source, I have a table called "students" while in the target, I have not created anything yet. Likewise, the configuration of Talend CDC towards the source and the different components necessary for the correct functioning of the tool are already installed, these are: Capture Engine Source Engine Target Engine Talend CDC Manager 32-bit ODBC connector for SQL Server in the DSN (Data Source Administrator) SQL Server Let's start: We start by configuring the source in the File -> Source subscribe menu and fill in the data according to our source configuration: Name (Alias): The name that you prefer IP address: Is the location of CDC source engine where the DB instance connection has been defined, this setup was done during the CDC installation and configuration Source type: The engine of the source DB Instance or Server name: The one configured on your server Then we notice that in the source it will show us "Training Source" with a green button that will help us to connect to it: When we connect to that source, we will see that the different configured environments where we can create the models are shown, for this example, we will use Training as our environment: Now we configure the journal, so we go to Source -> Journal Management Right click on the created DB "training_source" and choose the option “Start DB logging process” Then we can select the directory in which we want to store the journal and press Ok. Still in Journal Management, we have to select the table we want to monitor and right click on it and then, press the option “Start table logging process for”: Note: A primary key must be configured in the table to start the monitoring process on this. After starting the monitoring process on the "students" table, we need to verify that the journal is working, for this we are going to make a slight change on one of the records on this table: Note: We have changed John Doe's city to Miami instead of New York Now again in the Journal Management right click on “training_source” and “Display Journal”: We will see the only receiver available, which is the one written for the change we just made: We select it and go to the "Posts" tab and from there we will see the changes that were made: Note: The change of city is shown, the first record is "New York" and the second is "Miami" which guarantees the correct functioning of the journal. Creating the Replication: The first step is to create a configuration space within an environment; each replica within the same environment shares the same parameters and can be managed together. In the "Environment" tab, click on "Add" and fill in the fields like the following example: Environment: unique code of the environment to be created Source code: Unique code for the source, must be 3 characters exactly Environment type: For this example it’s “Training” Description: It’s recommended to add a description Click on “Add” and we will have the environment created: Note: The environment contains a tab for the source where the model is created and another tab for the target. Additionally, three tabs where the replication properties will be displayed, and graphic elements of the replica to visualize the process. Now from the “Source” tab right click on “Models” and then on “Add”: And configure the model as follows: Note: Two additional tabs “Table Options” and “Script” are shown, for which we will use the configuration set by default. Model: A unique code for the model Description: Description of the model Type: For this replication, we use JOURNAL, however, there is EXTRACT that allows a full extraction Click on “Add” and the model will be available in the “Models” folder in the “Source” tab of the “Environment”: The next step is to add Tables to the model, expanding the Model on the [+] button, then right-clicking on “Tables” and finally clicking on “Add” A window will appear to select the tables that we will add to the model: Note: As you can see, we are using the "Free" framework that doesn’t generate additional changes to our data With the "Query" button we can list all the available tables Select the tables, in this case "students" Move the selected table to the model Finish The table will appear created inside our model: The next step is to create the target, from the "Target" tab of our environment we right click on "Targets" and then "Add": The window for the new Target will open: Target: unique code for the target Description: Description of the target Target type: default value Name: Name of the instance of our target database Address: address of the server where the target database is located Exclude: it’s filled automatically as soon as we put the name of the instance Then click on “Add”, and we will see our created target: The next step is to create a distribution which means associating our model to a target, we can do it from the "Source" or "Target" tab; expand our created target or source, and click on “Distribute”: And we move the Model from the left panel to the right and click on “Ok”: A window opens automatically to finish the configuration of the distribution: Select the type of database Double-click on the connector A window will open to enter the data from our target database Click on “Ok” to add it If the connection is correct, a notification window will show us "Connection Successful" and will show us the databases available there, and finally click on "Add": We can now see from the "Map" tab that the target T01 is associated with the model M01: The next step is to create the target table, then from the "Target" tab we click on the "M01" model, then select the table to replicate, right-click on it, and finally "Create Target Table": After this we will see that a new window appears showing the DDL of the new table using the target database: Click on “Execute” and then we check in the target database to show that the table was created there, but it is empty: Run Replication: Now that the target, the model and the distribution are created, we must execute the replication; from the “Map” tab, right click on Model M01 and then click on “Properties”: From the "Recovery" tab we select the "Load" check box Then we go to the "Activity" tab, a confirmation window will ask us if we want to reload all the tables in that distribution, to which we will say "Yes". Now from the "Activity" tab click on the "Start" button After starting the replication, the fields of the "Counters" and "Last operation" section will show us the results and likewise, the distribution will remain active and waiting for any change that occurs in the source to replicate it automatically: Note: As shown, we had 3 records in the source and those same ones were selected, sent and added to the target. When the model is active, it will be green, otherwise it will be yellow, or it can be red if errors are encountered during execution: Now it must be checked in the target table to show that the records were replicated: Finally, we stop the replication by right-clicking on the model that appears in green and clicking on “Stop”. Talend CDC offers many more features for real-time replication. In the previous example, we perform a simple replication with the same source and target fields, as well as a manual execution since they can also be scheduled. In the next posts, I will show other features or functions that Talend CDC works with, as well as highlighting the differences with other frameworks available there. Fredy Antonio Espitia Castillo Talend Developer Certified https://www.linkedin.com/in/fredy199601/
Recap - NYC 2023 Snowflake Data for Breakfast
Last week on March 9, 2023, I attended the NYC Snowflake Data for Breakfast event. The event was packed with attendees from various industries and featured a lineup of great speakers who shared their experiences and insights on using Snowflake's platform. Also, to make things better, Snowflake had some great food and coffee flowing, which was needed since the event started at 8 am EST. Now the event kicked off with Brent Bateman, a Principal Sales Engineer at Snowflake, who discussed the capabilities of Snowflake, with a focus on its self-managed automated administration and unified governance platform. He highlighted how Snowflake is helping companies easily manage and access their data with less complexity and more agility. A lot of us know this today, but it was a good way to get the room to wake up and excited for the morning event. One thing Brent did not discuss though was the recent acquisition with Streamlit, but we all know there will be more to come on this in the future with Snowflake. Next, Jonathan Hyman, Co-Founder & CTO at Braze, shared how they use Snowflake to gain better insights into their customers and build more personalized experiences. He emphasized the importance of data-driven decision-making in today's business landscape and how Snowflake is an essential tool for achieving that goal. In addition, it was a breath of fresh air as Jonathan also walked us through at a high-level his current architecture and their customer journey. The event also featured a panel discussion with Jai Subrahmanyam, Senior Vice President, Head of Data Governance at The Blackstone Group, and Phil Andriyevsky, Partner- Wealth and Asset Management at EY. They shared how Snowflake has helped The Blackstone Group in leveraging data for better decision-making and driving business outcomes. They also discussed the value Snowflake has brought to their respective organizations in terms of scalability, performance, and cost savings. The event concluded with all speakers coming together to answer questions from the audience. Attendees had the opportunity to learn from Snowflake experts and industry leaders and gained valuable insights on how Snowflake can help their organizations effectively manage and utilize their data. Overall, this was a great event and extremely informative and insightful. Also, I look forward to exploring Snowflake capabilities and how they are being used with their current customers. Also, thanks again to the Snowflake Partner team for inviting Pingahla Colombia and Pingahla NorAm team to all the events being hosted around the world! Next up Snowflake Summit on June 26th! I can't wait to be in attendance.
Snowflake + Streamlit (Gamechanger!)
As this week draws the Data for Breakfast events with our Snowflake partner. I thought it would be only right to put together a blog post on the latest news about Snowflake's Streamlit acquisition. When I saw this acquisition, I thought this was a great job by Snowflake! But I’m still waiting on Snowflake to purchase dbt...😊 Now some of you may not have heard about Streamlit. Now for those who have only heard about Steamlit now, let me share with you a quick blurb about the company that you can easily find on the web. Streamlit is an open-source framework that allows data scientists and machine learning engineers to build web applications for their data projects. With Streamlit, users can create interactive and customizable web interfaces for their machine-learning models, data visualizations, and data exploration tools without needing to know how to write front-end web code. Now, what is the value of Streamlit to organizations and companies like yours? Well, their software allows the ability to simplify and streamline the development process. From what I have read and seen on YouTube. Using Streamlit, companies can build and deploy interactive applications faster and with fewer resources than with traditional dev methods. Now the kicker for the solution's value is that it can lead to faster time-to-market for new products, and solutions, improved collaboration between data engineers, data scientists, and business users, and greater agility in responding to changing market needs. To me, this is a game-changer! Now if it can do all of this why would Snowflake would want to purchase Streamlit? Now I do not sit on the Snowflake board, work for the company, etc. I have very little insight on the WHY, but being a technologist, I would believe the reason Snowflake purchased Streamlit is due to many of the benefits I highlighted above which include; Allowing Snowflake customers to quickly build and deploy web applications on top of their data. This could help Snowflake differentiate its offering from competitors and provide additional value to its customers. Snowflake's expertise in data management and analytics could help streamline Streamlit's development process and provide access to a larger customer base. This could help accelerate Streamlit's growth and adoption in the market. The acquisition could potentially lead to synergies in research and development, with both companies sharing knowledge and expertise in the areas of data management, analytics, and web application development Either way, this is a win, win for Snowflake customers and partners like Pingahla. I am excited to be attending the upcoming NYC Snowflake Data for Breakfast and I hope this is a topic that will be discussed. Snowflake + Streamlit News: https://investors.snowflake.com/news/news-details/2022/Snowflake-Announces-Intent-to-Acquire-Streamlit-to-Empower-Developers-and-Data-Scientists-to-Mobilize-the-Worlds-Data/default.aspx
The Importance of Data Governance & Data Security in 2023 and onwards!
In today's data-driven world, data has become one of the most valuable assets for businesses. However, the increasing amount of data being generated and processed also poses significant risks to data security, privacy and data regulations. Therefore, it has become essential for businesses to implement robust data governance and data security practices to mitigate these risks. Now I understand, not all organizations have a clear data governance strategy or org in place, but I am seeing more and more of my customers make this investment due to data regulations such as CCPA, GDPR, and Virginia CDPA to name a few. However, business and development teams often face significant challenges while working with data governance and data security teams. These teams can slow down the development process, making it difficult for business and development teams to deliver new features and functionality on time. This delay can cause frustration for the business users, developers and may impact the overall success of the project. Despite these challenges, it's crucial for the business and development teams to work closely with data governance and data security teams. These teams are responsible for implementing the policies, procedures, and technologies that mitigate security risks, ensure data compliance, and protect sensitive information. Here are some reasons why: Mitigating security risks: Data governance and data security teams are responsible for identifying and mitigating security risks. These teams establish security policies and procedures, perform security assessments, and implement security controls to protect data from unauthorized access, theft, or misuse. By working closely with these teams, developers can ensure that the applications they develop are designed and implemented with security in mind, reducing the risk of security breaches. Ensuring data compliance: Data governance and data security teams are also responsible for ensuring that the organization's data practices comply with relevant laws and regulations, such as CCPA, Virginia CDPA, GDPR or HIPAA. These regulations have strict requirements for data protection, data privacy, and data access, which can be challenging for development teams to navigate. By working closely with data governance and data security teams, developers can ensure that their applications comply with these regulations, reducing the risk of non-compliance and potential legal and financial consequences. Protecting sensitive information: Data governance and data security teams are responsible for protecting sensitive information, such as personal information, financial information, and intellectual property. By working closely with these teams, developers can ensure that the applications they develop are designed and implemented with data protection in mind, reducing the risk of sensitive information being exposed or misused. Now, let's take a look at some examples of companies that have faced consequences for poor data governance or data security practices: Facebook: In 2018, Facebook was embroiled in a massive data privacy scandal involving the data analytics firm Cambridge Analytica. It was discovered that Cambridge Analytica had obtained data from millions of Facebook users without their consent, which was used to influence political campaigns. This scandal led to a massive loss of trust in Facebook, and the company faced intense scrutiny from regulators and lawmakers around the world. Equifax: In 2017, credit reporting agency Equifax suffered a massive data breach that exposed the personal information of approximately 143 million people. The breach was caused by a vulnerability in Equifax's web application framework, which had not been patched despite a security alert being issued months earlier. The company faced intense criticism for its poor data security practices, and it ultimately had to pay a $700 million settlement to consumers, regulators, and lawmakers. Citibank: In 2020, Citibank was fined over $400M due to risk management and data governance issues. The bank had failed to implement proper risk management practices and internal controls, which led to a massive error in its payments system. The error resulted in Citibank accidentally transferring nearly $900 million to a group of creditors of Revlon, the cosmetic company. Citibank was unable to retrieve the funds, and it ultimately had to absorb the loss. The incident highlighted the importance of robust risk management and data governance practices, as failure to implement these practices can result in significant financial losses and regulatory consequences. These examples illustrates how poor data governance and risk management practices can have severe consequences for organizations like these three examples. It's essential for business and development teams to work closely with data governance and security teams to ensure that the applications and solutions they develop are designed and implemented with these practices in mind. By doing so, organizations can reduce the risk of costly errors, data breaches, and regulatory fines. In conclusion, while working with data governance and data security teams can be challenging, it's essential for development teams to prioritize these practices. By doing so, business and developers can ensure that the applications they develop are designed and implemented with security and compliance in mind, reducing the risk of security breaches, non-compliance, and potential legal and financial consequences. The mind set that these two parts of the organization are different should not exist. Think of these two parts of the org as an extended part of your team wanting you on the business or delivery team to succeed while making sure the proper governance and security is in place!
Snowflake vs DataBricks
At Pingahla we work with customers who are at different stages of their cloud journey. A common theme that we see is the selection of the right cloud data warehousing and analytics platform. This becomes especially important as the wrong choice may seriously hinder the speed and agility of digital transformation and the company's ability to respond to changing market dynamics. In this post we will evaluate and compare two major players in this space - Snowflake and DataBricks DataBricks and Snowflake are both cloud-based data warehousing and analytics platforms. While they share some similarities, there are also some key differences between the two. One major difference is that DataBricks is a unified analytics platform that combines data engineering, data science, and business analytics in a single environment, while Snowflake is a cloud-based data warehousing solution that focuses specifically on data storage and querying. Another difference is that DataBricks uses an in-memory computing model, which allows for fast processing of large amounts of data, while Snowflake uses a columnar storage model and separates compute and storage resources, which can make it more scalable and cost-effective for certain use cases. Additionally, DataBricks offers a range of integrated tools and services for data engineering, machine learning, and business intelligence, while Snowflake integrates with a variety of third-party tools and services for data integration, visualization, and analytics. Overall, the choice between DataBricks and Snowflake will depend on the specific needs and goals of an organization. DataBricks may be a better fit for organizations that need a full-featured analytics platform with integrated tools and services, while Snowflake may be a better fit for organizations that have specific data warehousing requirements and want a scalable and cost-effective solution. At Pingahla, we help solve complex data challenges for our customers. We are experts in building cloud based data warehousing and analytics solutions. To know more, please reach out at info@pingahla.com
How To Upload Files to Amazon S3 Bucket Using Talend Studio
Our instructional step by step video by our #Pingahla certified #Talend Kuldeep Singh will show you how to create an ETL data pipeline using Talend Studio and its #Amazon #S3 bucket connector. Below are the steps that will show you how to build out such a data pipeline, else you can scroll to the bottom of this blog post and check out our YouTube instructional video on the following topic. Open Talend Studio. Using the Local or select a Connection. In this example, we will select Local and select an existing project, "Local_Project." You can do the same or select any other project where you want to create the data pipeline. Once the Talend Studio has started, make sure you are in the Integration Perspective, if it is not already selected. You can also change the perspective by going to Window > Perspective > Integration. This can also be done from the taskbar by clicking on Integration as shown in the picture below: Create a new Job by right-clicking on Repository > Job Designs, then selecting Create Standard Job. Enter the Job Details and click on finish when done. Always put in a description for the job, this will help others to understand the purpose of the job in a collaborative environment. Now we will start adding the components from the Palette to the Job. For the purpose of the guide, we will generate data using the tRowGenerator component available in Talend Studio Palette. The generated data will be written in a delimited file. The generated file will then be placed on an S3 bucket. First, we will start creating a file with dummy data using the tRowGenerator component. In the Palette view, begin typing the name of the component, in this case, tRowGenerator, into the Find Component box and then click the Search button. You can also press the Enter key. You can also click anywhere in the Designer and start typing the name of the component to add it. Once, the component is added to the designer, double click the component to open it. Here we will define the structure of data to be generated. ·Add as many columns to your schema as needed, using the plus (+) button. Type in the names of the columns to be created in the Columns area and select the Key check box if you wish to define a Key for the generated data Make sure you define then the nature of the data contained in the column, by selecting the Type in the list. According to the type you select, the list of Functions offered will differ. This information is therefore compulsory. Once done, Click on OK to close the dialog box. Now to write the generated data by the tRowGenerator to a flat file on your local system, we will use the tFileOutputDelimited component. Just follow the same steps that we used to add the tRowGenerator component. Once added, your Job should look like this. Now we will connect the tRowGenerator component to the tFileOuputDelimited component. To do that, right click the tRowGenerator_1 component, then select Row > Main. Move the mouse over top of tFileOuputDelimited_1 component and click on it. Once done, your Job should look like this. Now let’s configure the tFileOutputDelimted component. To do that, double click the tFileOutputDelimted_1 component and you would notice that the Component View has opened at the bottom of the screen. Provide the absolute path of the file where do you want to output the generated data. In this example, it is "C:/Users/PINGAHLA/Desktop/Talend_Demo/demo.txt" Now let’s add the tS3Connection & tS3Put component to the Job. Now let’s connect the initial Subjob that we are using to create a file and connect it to the tS3Connection. Right-click on the tRowGenerator_1 component, then select Trigger > On Subjob Ok. Move the mouse over top of tS3Connection_1 component and click on it. Similarly, Right-click on the tS3Connection_1 component, then select Trigger > On Subjob Ok. Move the mouse over top of tS3Put_1 component and click on it. Your final Job should look like below. Now let’s configure the tS3Connection & tS3Put component. Double-click tS3Connection_1 to open its Basic settings view on the Component tab. In the Access Key and Secret Key fields, enter the authentication credentials required to access Amazon S3. Ensure that the values are enclosed in double-quotes. The tS3Connection component should have all the details as above. Double-click the tS3Put component to open its Basic settings view on the Component tab. Select the Use an existing connection check box to reuse the Amazon S3 connection information you have defined in the tS3Connection component. In the Bucket field, enter the name of the S3 bucket where the object needs to be uploaded. In this example, it is talend-data and the bucket is already present in Amazon S3. Note: Ensure that the bucket is already present in your Amazon S3 Instance. In the Key field, enter the key for the object to be uploaded. In this example, it is demo. Note: This field ensures the filename for the file that is being uploaded to the S3 bucket. In the File field, browse to or enter the path to the object to be uploaded. In this example, it is "C:/Users/PINGAHLA/Desktop/Talend_Demo/demo.txt" Press Ctrl + S to save the Job. Press F6 to run the Job. Run details on a successful run should look like below: Login into your S3 bucket and you will find that the file has been uploaded to your S3 Bucket. In case you want to upload the file to a particular folder in your S3 bucket, enter the entire path along with the filename in the Key field. In this example, it is “demo/demo.txt”. Press Ctrl + S to save the Job. Press F6 to run the Job. Login into your S3 bucket and you will find that the file has been uploaded to your S3 Bucket. Check out our instructional video on how to create an ETL data pipeline using Talend Studio and its #Amazon #S3 bucket connector
Self-Service with Talend Solutions (Stitch, Pipeline Designer, and Data Preparation)
More and more of Pingaha's Talend customers are looking for ways for self-service around data extraction, transformation, and insights, allowing them to conduct their business in a globally competitive market. At Pingahla we are constantly educating our Talend customers on many of the current and new self-service Talend solutions. Within this blog posting, I will be discussing the high-level benefits of Talend's self-service solutions such as Stitch, Data Pipelines and Data Preparation. Let me first start with Stitch. In Nov 2018, Talend acquired Stitch to complement its unified platform, allowing its users to have the ability for self-service for data extraction for non-technical users. Stitch is a web-based cloud EL (Extract and Load) solution in which users can quickly extract data from a wide variety of supported sources and load that data into a specific supported target with Stitch. This can be down in as little as 7 major steps. 1. Simply go to www.stitchdata.com. 2. If you have an account either Sign In or Sign Up for a Stitch account. Note: Within these steps below, I have created my account. 2.1. Enter the required requested information to create your Stitch account. 2.2 Stitch will send an email to confirm your email address. Please click the confirmation email, as this will confirm your email address and account with Stitch. 3. Once you have either logged in or created your account, next you will want to select the data source in which you would like to extract the data. Stitch has around 108 data source connectors. Quickly find your source and select the connector. In this example, we have selected the "Microsoft SQL Server Integration" source connector. Note: If you cannot find your source connector, you have the ability to suggest a connector to the Stitch-Talend team. If you are a paid customer, most likely your suggestion will move hirer within the development queue for Stitch-Talend. 4. Next, you will need to enter in details of your data source and select the specific data tables so that Stitch can connect and extract the data. Note: You can also invite a member of your team to enter in the connection details if you do not have it. 5. Once you have entered your source details you will then need to select your target destination. Stitch has 8 target destination connectors in which you can choose from. In this example, we are selecting Amazon S3. 6. Once you have selected your target destination, you will then need to enter your target connection details. 7. Once you have entered in your target destination, you will set up your schedule and PRESTO, you have created your very first self-service data pipeline with Stitch. Another truly self-service ETL (Extract, Transform and Load) solution from Talend is Pipeline Designer. Talend's Pipeline Designer is a cloud-based ETL solution that recently was released to the public in April 2019. Not only does it allow for its users to build-out ETL data pipelines using drag and drop functionality, but it also allows users to incorporate and support Python coding! A huge yay for Data Scientist. Now there are a few restrictions around the Python competent which I recommend you review the Talend documentation. But here are a few steps to get you started with Talend's Pipeline Designer. 1. To get started with Pipeline Designer you will need to first have access to Talend Cloud. Note: Talend Cloud has made recent announcements with its latest update with Talend Cloud on Azure, so you will need to make sure you select the correct cloud provider and zone before logging in. In this example, I am logging into the Talend Cloud AWS, East Zone. 2. Once you have logged into Talend Cloud, you will be directed to Talend's Cloud portal which will be the "Welcome" page. Here you will be allowed to Launch the different Talend Cloud applications or watch specific tutorials on the different Talend Cloud applications. For the purpose of this blog posting, you will need to launch Pipeline Designer. There are two ways to launch Pipeline Designer, which include finding the application on the Welcome page and click on the Launch button, or within the left-hand corner select the "Select an App" drop-down and select Pipeline Designer. 3. Once you Launched Pipeliner Designer you will either be directed to Pipeliner Designer or a Pop Up that will request you to set up your Talend Pipeline Designer remote or cloud engine. In this blog posting, I will not be going through the steps in setting up a Talend remote/cloud engine for Pipeline designer as I will revisit this in a future post. 4. After launching Pipeline Designer you will be directed to the "Datasets" page within Pipeline Designer. You can create your data pipeline from the "Datasets" page or the "Pipelines" page. In my blog posting, I will be creating my data pipeline from the "Datasets" page by right-clicking the Customer data source and Add Pipeline. 5. Once you add your Dataset to a Pipeline or start creating a data pipeline from scratch, you will be directed to the Talend Pipeline Designer Canvas. Here you will be able to create your ETL or ELT data pipeline process. In this example, you will see we already have our Customer data source as the first component in our data pipeline. 6. Once you have established a pipeline and data source, you can then add an additional Talend processor competent for your data pipeline. To add a Pipeline processor, you will want to click the "+" button within the data pipeline flow. The data processor will allow you to transform your data. Talend offers 9 major processor components with Talend Pipeline Designer. For the full list please visit; Talend Cloud Pipeline Designer Processors Guide. 7. After adding the different Talend processor competent to your data pipeline, you will then need to add the Target destination. To add the Target destination click on the "Add Destination" button within the data pipeline flow. Note: You have the ability to preview the data as it flows through the different Talend processors by clicking on the "Preview" button within the data pipeline flow. 8. Now that you have completed building out your ETL/ELT Talend Data Pipeline Designer data pipeline, you will want to run your pipeline. To simply run your data pipeline, click on the "Play" button on the top middle right-hand corner. 9. If the data pipeline you have completed is correct. You will receive a green pop up notifying you your data pipeline has completed and loaded successfully or unsuccessfully. One of my favorite Talend self-service solutions is Data Preparation. Not only do I believe this is a favorite self-service solution from Talend, but many of my customers would say the same. Talend Data Preparation aka, Data Prep is a on-prem or cloud solution that is truly simple and easy to use solution that enables users to quickly identify data anomalies as well as speed up the delivery process of cleansed or enriched data by allowing non-technical users to collaborate with IT on collaborative pipeline solutions, or enrich data sets for BI reporting needs without involving IT. Some of the major benefits of Talend Data Preparation that make it my favorite solution is the following; * Collaboration in development with non-technical users * Replacement of STTMs (Source To Target Mappings) * Self-service to profile, transform, and enrich data sets These are the three major reasons why I enjoy using Talend Data Preparation. The simple and easy to use solution looks like Microsoft Excel. So off the bat, there is nothing technical or scary about the solution. But let me break it down. 1. Talend Data Preparation is an on-prem and Cloud-based solution. Here I will focus on the cloud version. The on-prem version works very similarly, except when connecting to data sources and how to collaborate with a developer's Talend Studio data pipeline. But I will get into that in a later post. 2. Similar to steps 1-3 outlined in the Talend Pipeline Designer steps, you will need to make sure you are logged into Talend Data Preparation. Or if you are using the on-prem solution, you will need to have the on-prem solution installed with a valid license. 3. Once you are logged in, you will need to make sure you have the data sources you would want to profile, transform or enrich. Talend allows users to add different types of datasets that include local flat files (txt, Excel, csv, and etc), talend jobs, databases, Amazon S3, and Salesforce. 4. After you have established and imported your datasets, you can now create a "Preparation" in Talend's simple and easy to use grid-like graphical interface. 5. To create a "Preparation," simply click on the "Add Preparation" button. The "Add Preparation" button will allow you to profile, enrich and transform your source data. 6. Now that you have imported and set up your connections to your Datasets, and have created your first preparation you will now notice your add has been uploaded to Talend's grid-like interface. Now what you may or may not notice off the bat is that Talend Data Preparation has quickly profiled the data for you highlighting its analysis. What do I mean by this? Now that your data has been created as a preparation, take a look at the following data highlights. 7. Column Discovery: Talend has analyzed the metadata doing its best to determine the column metadata. You have the option to update this manually if the analysis is incorrect. 8. Quality Bar: Talend Data Preparation, highlights columns in different colors (Green, Black, and Orange). The different colors are used to identify valid records, empty records, and invalid records. You can also create your own rules on the analysis. 9. Talend Data Preparation, allows users to enrich and transform the data set without enriching or transforming the data directly on the source. But within Talend Data Preparation you are able to leverage "FUNCTIONS" that allow you to add enrichment or transformation rules to the data set. 10. Data Profiling: As mentioned prior, another huge benefit of Talend Data Preparation is the data profiling aspect. Users are able to showcase and quickly profile the data columns to have a better understanding of the data. 11. Data Lookup: With an self-service solution you need to have the functionality to do a lookup for either validation or enrichment, and Talend Data Preparation allows you to easily do this with the lookup feature. 12. Filters: In addition, to adding lookups you can also add filters to the data set. This can quickly be done by selecting the Filters option on the top middle-right hand corner of Talend Data Preparation. 13. Recipe: Now here comes the best feature, "Recipe." Now as you enriched and transformed your data while doing this Talend Data Preparation was documenting each transformation step allowing you to no longer needing to create lengthy STTM's as well as see how the data will change when you apply a function. In addition, this "Recipe" can be shared with a developer's Talend ELT or ETL job in Talend Studio, allowing a user to collaborate with IT. 14. Exporting: The last beneficial feature is the exporting feature. Talend Data Preparation allows users to export the data in multiple formats that include flat files, Tableau and Amazon S3. Hopefully, you enjoyed this blog posting. Please leave a comment for any additional feedback or questions. Also, if you are interested in learning more about Pingahla's Talend services, please contact sales@pingahla.com.
Talend Cloud on Microsoft Azure - What to Know!
If you are a Talend retail customer, you may have been holding off moving to Talend Cloud because it was hosted on AWS, or you may be an Azure customer looking for the native Azure solution. Well, now you have an option in using Talend Cloud on Microsoft Azure. Talend made this huge announcement in Oct 2019, and many on-prem retail customers want to flip the cloud switch. Hopefully, you had the opportunity to check out Talend's Webinar on Oct 30, 2019, that introduced the Talend Cloud Azure offering, but if you missed it you are in luck as Talend is having another live Webinar on Nov 13, 2019, 11 am SGT. But before making that jump to Talend Cloud on Azure, I put together this blog post to help educate you on Talend's Azure offering as it is a little different than the Talend Cloud AWS offering due to the Talend Azure Roadmap, which I will address within this blog posting. Now with Talend Cloud rolled out on Microsoft Azure, you will have to note for this first phase for Talend you will only be able to leverage Azure on the United States - West Zone. Now, why do I bring this up? If you are a current Talend Cloud AWS user, you have the ability to select different zones based on your region. Now, as per the latest webinar, Talend will expand the zones in the future. So keep this in mind. One of the other differences with Talend Cloud on AWS and Talend on Azure is that Talend's Data Preparation and Data Stewardship solution won't be available on Talend on Azure until the end of the year (2019). Luckily for you all, we only have another month until the end of Dec 2019, in which these solutions will be readily available for the Talend Cloud Azure users. Besides Talend Data Preparation and Data Stewardship not being available, Talend's API Designer and API Tester won't be available day 1. Though it was not discussed on the prior webinar, or I may have missed it, no date was given when Talend's API Designer and API Tester would be available on Talend Cloud on Azure. If anyone knows when this would be available, please let me know, else I will try to find out on the next live webinar. But though some of these applications may not be available, what is available is that Talend Cloud Azure users will be able to harness the cloud power of Microsoft Azure, in which Talend Cloud Azure runs natively. In addition, to access your TIC, you will have the ability to leverage Talend Studio and Talend's new self-service ETL solution, Pipeline Designer. This is a huge step for Talend and its Talend users and I ask you all, to check out and visit www.talend.com/talend-azure-trial to get access to your free 30 day trial to Talend Cloud on Azure. If you are looking for support and assistance with your Talend Cloud Azure 30 day trial, don't hesitate to reach out and if you are interested in learning more about Pingahla's Talend services, please contact sales@pingahla.com.
Enabling and Installing Talend Custom Components and Connectors within Talend Studio
If you are currently using Talend Open Studio (TOS), or Talend’s Enterprise Studio, you have already been able to leverage the many benefits of the out of the box Talend components and connectors. These components and connectors help accelerate the development of your ETL/ELT data pipelines. Currently, the Talend Enterprise Studio version has over 1000+ components and connectors and where TOS has about roughly 850 components and connectors in which can be used. For a full list of all the available components and connectors, click on the Talend components and connectors URL below. https://www.talendforge.org/components/index.php As Pingahla has worked with many customers using both TOS or Talend’s Enterprise Studio, we sometimes run into an issue where a connector or component is not available. Now the great benefits for Talend are if there are no connectors or components available, there are many ways around this, such as connecting by API, using ODBC, or JDBC, etc. But before going down this rabbit hole, I ask you to check out Exchange.Talend.com as you may be able to find an available connector created by the Talend Community. As some of you may or may not know, Talend was once and still is an open-source solution. Talend has a HUGE community that helps build new components or connectors. In this blog posting, I will provide some instructions on how you can download and use any of the available components or connectors from the Talend’s Exchange place or other Talend forums. Step 1: First you will need to have TOS or Talend Enterprise Studio version installed. In this blog posting, we will be sharing screenshots in using Talend's Data Fabric Studio version 7.1.1. The following steps would work with TOS or any of the other Talend Enterprise Studio versions, including the latest 7.2.1. Step 2: Open your TOS or Talend Enterprise Studio on your machine. Step 3: As you open Talend Studio, you will be asked to select a connection, an existing project, etc. In this example within this instructional blog, we will select the default Local connection and Local Project. It does not matter which you select, as long as you open Talend Studio. Step 4: Your Talend Studio should have successfully opened. You may have a pop up that appears to download some latest updates or sign up/sign into Talend Community. You can close out those windows. Step 5: Navigate to the top-left toolbar within Talend Studio and click on the "Window" option. From there you will need to select the "Preferences" option. Steps 6: Once completing step 5, your Talend Studio Preferences window will pop up. Navigate to the "Talend" option. Step 7: Next you will need to navigate outside of Talend Studio, and create a new folder where you will store and save the custom Talend components and connectors. In this example, we are creating a folder in the following folder and Subfolder (Talend\tlnd_custom) within the C drive on a Windows machine. Step 8: Once you have created a folder to store and save your custom Talend components and connectors, you will need to navigate back to Talend Studio and update the "Temporary files" path with the new folder directory in the Preferences/Talend/Temporary files section. Once completed click, "OK." Steps 1-8 will allow you to start enabling and use custom Talend components and connectors that you are able to obtain Talend’s Exchange place or other Talend forums. The following steps below will walk you through in setting up the custom Sharepoint connector (tSharepointFile) created by knowledgerelay.com. Once you have completed Steps 1-8, you will need to download a custom Talend component and connector. Step 9: In this step, we will download the FREE tSharepointFile component from knowledgerelay.com from Talend’s Exchange place. Please note, you will need to create a FREE user account on Talend's Exchange place if you do not have one already. Make sure you save your custom Talend component or connector file in the new folder you created on Step 8. Step 10: Once you have downloaded your custom Talend component or connector you will need to unzip the zipped file. In this example, we have downloaded the zipped tSharepointFile folder in our C:\Talend\tlnd_custom folder and we will use 7-Zip to unzip and place all the zipped files into a new unzipped folder named, "tSharepointFile." Step 11: Once you have successfully unzipped the custom Talend component or connector folder, you can delete the zipped download file. In this example, we are deleting the tSharepointfile.zip. Step 12: Restart your Talend Studio, your new Talend component or connector should appear within the Talend Palette. Note: If you do not see your new Talend component or connector, this could be the XML file of the custom Talend component or connector is not well configured. Please follow the optional steps below, or if you want to modify where you custom Talend component or connector resides in the Talend Palette, you can also use the optional steps below. Step 13: Locate the custom Talend component or connector XML document. This should be the only XML document file within the custom Talend component or connector folder. In this example, we will be reviewing the tSharepoinFile_java.xml file. Step 14: Open the XML custom Talend component or connector file. It is recommended to use Notepad++ or Notepad. In this example, we will open the tSharepoinFile_java.xml file with Notepad. Step 15: Once you have opened the custom Talend component or connector XML file, search for the following XML tag, "." If you cannot find the following tag, this could be the reason why the custom Talend component or connector is not appearing and showing up as an option in Talend Studio. Step 16: Under the and above the tag add and SAVE the following statement; File/Management The following statement will add the tSharepointFile as a component to Talend Studio Palette. It also stored the tSharepoinFile under the File/Management sub-sections. Note: You can also modify and change existing custom Talend component or connector locations by updating the FAMILY path. Step 17: Restart your Talend Studio, your new Talend component or connector should appear within the Talend Palette. Hopefully, you enjoyed the technical and instructional blog posting. Please leave a comment for any additional feedback or questions. Also, if you are interested in learning more about Pingahla's Talend services, please contact sales@pingahla.com.
How is your data?
In #2017, Harvard Business Review wrote an article about how only 3% of Companies’ Data Meets Basic Quality Standards. Hopefully, in #2019 you are one of these companies. If not, reach out to us so that we can discuss how we are leveraging and implementing our partner solutions at Informatica, Talend, AWS , Google Cloud and Pingahla’s accelerators making our customers one of the 3%, with quality data within their organization!

34 results found with an empty search

Get in Touch with Pingahla

New York City, NY (Headquarters)

Pune, India

Bogotá, Colombia