YDT Blog

In the YDT blog you'll find the latest news about the community, tutorials, helpful resources and much more! React to the news with the emotion stickers and have fun!
Bottom image

[D] 2nd Order Approximation in XGboost’s Objective Function

Hi all,

I have a quick question regarding XGboost's objective function. I was reading the XGboost paper (https://arxiv.org/pdf/1603.02754.pdf). I see that authors approximated the original objective function using a 2nd order Taylor series (page 2, section 2.2). Is there a particular reason why it's expanded to 2nd degree and not higher? I'm guessing that linear apprx. is not enough and higher orders require more computational power, but is there a mathematical background or is this a design choice?

submitted by /u/_kty
[link] [comments]
Source: Reddit Machine Learning

ELK with Talend cloud

Overview

ELK is the acronym for three open source projects where E stands for Elasticsearch, L stands for Logstash and K stands for Kibana. ELK is a robust solution for log management and data analysis. These open source projects have specific roles in ELK as follows:

  • Elasticsearch handles storage and provides a RESTful search and analytics endpoint.
  • Logstash is a server-side data processing pipeline that ingests, transforms and loads data.
  • Kibana lets you visualize your Elasticsearch data and navigate the Elastic Stack.

In this blog, I am going to show you how to configure ELK while working with Talend Cloud. The blog will focus on Loading Streaming Data into Amazon ES from Amazon S3. Refer to this help document from AWS for more details

 

Process Flow

Talend Cloud enables you to save the execution logs automatically to Amazon S3 Bucket. The flow for Talend Cloud logs to be working with ELK is as shown below.

Talend Cloud ELK

Once you have configured the Talend cloud logs to be saved to Amazon S3 bucket, a Lambda function is written. Lambda is used to send data from S3 to Amazon ES domain. As soon as a log arrives into S3, the S3 bucket triggers and event notification to Lambda, which then runs the custom code to perform the indexing. The custom code in this blog is written in Python.

Prerequisite

To configure ELK with Talend Cloud logs you need

  • Talend Cloud account with log configuration in TMC – refer to this help document for Talend Cloud logs Configuration
  • Amazon S3 bucket – refer this amazon page on Amazon S3
  • Amazon Lambda function – refer this amazon page on Amazon Lambda functions
  • Amazon Elasticsearch domain – refer this amazon page on Amazon Elasticsearch domain

Steps

This section outlines the steps needed for loading steaming Talend Cloud logs to amazon ES domain

Step1 : Configure Talend cloud

  • Download the cloud formation template. Open your AWS account in a new tab and start the Create Stack wizard on the AWS CloudFormation Console.

In the Select Template step, select Upload a template to Amazon S3 and pick the template provided by Talend Cloud.

Elk with Talend cloud

 

In the Specify Details section, define the External IDS3BucketName, and S3 prefix

Elk with Talend cloud

 

Click Create. The stack is created. If you select the stack, you can and find the RoleARN key value in the Outputs

Elk with Talend cloud

 

In the Review step, select I acknowledge that AWS CloudFormation might create IAM resources.Elk with Talend cloud

 

Go back to Talend Cloud Management Console and enter the detailsElk with Talend cloud

 

Step2: Create Amazon Elastic Search Domain

Refer to this document : https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-gsg-create-domain.html

For this KB, iam selecting Development and TestingElk with Talend cloud

 

Give a domain nameElk with Talend cloud

 

For the rest of the options, give as needed by the organization and click on createElk with Talend cloud

 

Step3: Create Lambda function

    • There are multiple ways to create a Lambda function. For this blog, I am using Amazon Linux Machine with CLI configured.
    • Using putty login to the Ec2 instance

     

    Install Python using these commands

    yum -y install python-pip zip

    pip install virtualenv

    Elk with Talend cloud

     

    • Run the next set of commands

    # Prepare the log ingestor virtual environment

    mkdir -p /var/s3-to-es && cd /var/s3-to-es

    virtualenv /var/s3-to-es

    cd /var/s3-to-es && source bin/activate

    pip install requests_aws4auth -t .

    pip freeze > requirements.txt

    Elk with Talend cloud 

     

    Validate that the files needed are installed

    Elk with Talend cloud 

     

    Create a file s3-to-es.py and past the attach code in the file

    Elk with Talend cloud 

     

    Change the permission to 754

    Elk with Talend cloud 

     

    Run the command to package

    # Package the lambda runtime

    zip -r /var/s3-to-es.zip *

    Elk with Talend cloud 

     

    Send the package to S3 bucket

    aws s3 cp /var/s3-to-es.zip s3://rsree-tcloud-eu-logs/log-ingester/

    Elk with Talend cloud 

     

    Validate the upload in the S3 bucket

    Elk with Talend cloud 

     

    Create Lambda function

    Elk with Talend cloud 

     

    In the function code, select ‘Upload a file from Amazon S3’ as shown below and click on save

    Elk with Talend cloud 

     

    Add a Trigger, by selecting S3 bucket

    Elk with Talend cloud

     

    Validate that the trigger is added to the s3 bucket

    Elk with Talend cloud

    Elk with Talend cloud 

     

    Now let’s execute a Talend job for the log to be routed to S3. You could notice from the Lanbda Monitoring tab that the log is being pulled in. You could also view the logs in Cloudwatch

    Elk with Talend cloud 

     

    Step4: Create Visualization in Kibana

    Navigate to Elastic search domain and notice that a new indices is created

    Elk with Talend cloud 

     

    You could also search for this index in Kibana dashboard

    Elk with Talend cloud 

     

    Click on the discover to view the sample data

     

     

    You could now create visualization and see those in the dashboard

    Conclusion

    In this blog we saw how we could leverage the power of ELK with Talend Cloud. Once you have the Elk configured you could use it for diagnosing and resolving bugs and production issues or for the metrics about health and usage of jobs/resource.  Well that’s all for now, keep watching this space for more blogs and until then happy reading!!

     

The post ELK with Talend cloud appeared first on Talend Real-Time Open Source Data Integration Software.

Source: Talend

Importance of setting realistic expectations

If product owners have realistic expectations, your models will have more impact quicker

It seems like every company is now building Data Science teams and investing in Machine Learning platforms, either third party or built in-house.

However research has shown that few companies are deploying Machine Learning models to production [1]. While technical complexity plays a part, unrealistic expectations are also to blame.

Setting reasonable expectations

Developing Machine Learning models requires a big investment. You need to hire Data Scientists, invest in Big Data tools and build / buy model serving platforms. As a result, the expectations are often high, sometimes too high.

In order to ensure the success of an AI project, Data Scientists need to make sure they set realistic expectations before they even start building a model. If the expectations are unattainable, you will get approval to develop models but they will never make it into production.

Setting expectations can be complicated but there are a few things you can do to help steer the conversation:

  • Use existing processes as a baseline
  • Talk about false positives and false negatives — what are you optimizing for ?
  • Use a simple model to set expectations

Use existing processes as baselines

Machine Learning models usually replace or automate existing processes. Look into those processes and try to quantify its overall performance (accuracy, false positive / negative rates, time to prediction, etc). This will give you a baseline of what your model will need to be able to achieve.

Unless your model matches the performance of existing processes, it will not be deployed no matter the cost savings

Don’t spend too much looking into the overall cost of the solution, focus more on the performance of the overall model. The goal here is not to define how much the business could save by implementing a Machine Learning solution, it is about learning what level of performance your model is expected to reach.

Define acceptable false positive and false negative rates

When setting expectations, the first metric we think about is accuracy. It is easy to understand and is usually a god starting point when talking with product owners.

However, agreeing an a target accuracy is not enough. In nearly every application, the cost to the business of a false positive is different than the cost of a false negative. In a lot of cases, the costs can be an order of magnitude apart !

Take the case of fraud prediction for example. Let’s assume you sell a product for £100 with 20% profit margin. This means that one sale will generate £20 of profit, but if someone buys the product using a stolen card you will be 80£ out of pocket (production costs) plus fees charged by payment processors (which can be in the £20 range). In this case the cost of a false positive is £20 while the cost of a false positive is £100 ! As a result your model can theoretically have a false negative rate that is 5 times higher than the false positive rate and will still be profitable.

Use a simple model to set expectations

Deploying a Machine Learning model involves a lot of moving pieces, ranging from serving to monitoring. The first model you deploy will have to be relatively simple as the focus will be on the infrastructure and processes surrounding the model.

The first iteration of a model is about setting up processes, not about performance

Before spending too much time on anything else, start developing a very simple model with little to no feature engineering. Assume that this will be the first model to be deployed.

If the simple model does not come close to achieving the metrics agreed with the product owner during phase 1 and 2, there are two approaches you can take:

  • Build feature processing and complex models but run the risk of not being able to deploy the model due to deployment cost
  • Re-engage product owners to set more realistic expectations for a first version of the model

Choosing between these options will depend on your organisation and how strategic the model you are developing is. In any case, don’t proceed without having talking to product owners first. It might not seems like much but this could speed up the development cycle 10 fold.

Conclusion

In order for Machine Learning models to be widely adopted in an organisations, Data Science teams need to engage the business during before they build the first model.

Agreeing performance metrics at the beginning of a project and regularly checking in with product owners will help ensure that models are deployed swiftly.

References:


Importance of setting realistic expectations was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Toward Data

Latest Posts