YDT Blog

In the YDT blog you'll find the latest news about the community, tutorials, helpful resources and much more! React to the news with the emotion stickers and have fun!
Bottom image

Advice needed to create an economic News dataset for niche categories that will be used for classification.

So basically we're trying to create a dataset for classifying news articles into different classes like cybercrime, corruption, acquisition, merger, all of which would be economic news. We're using BERT to train our classifier and needed 2k-3k articles per class for an MVP.

So right now, we were gathering articles by using APIs like webhose.io and pushshift.io to fetch a batch of articles based on a keyword "cybercrime" and hand label it to make sure it actually is cybercrime and remove false positives and duplicates. It's kind of tiring since we need to go through 20k articles to get 2k for each class.

We were looking for services and anything which would make this process easier. We're students, so we'd prefer some free service, but we're willing to spend money if we get very desperate. Any advice is welcome!


submitted by /u/roonishpower
[link] [comments]

Source: Reddit Data Science

[P]Neural Principal Component Analysis

Dear All:

PCA is the old linear working horse, while t-SNE and UMAP are new non-linear methods for dimension reduction. The non-linear ones are beautiful but loss linear intereption. My npca combines linear and non-linear ideas. The idea is to make a self-supervise MLP as follow:

data => 2 PC => 32 => 32 => data

It is a linear encoder followed by a two-layer non-linear decoder. After npca, we only care about the linear part: PC and loading, and ignor the decoder ( replace it by the human eyes). Thus it is still a simple rotation of the data, but gets more variance explained. It is a drop-in replacement for PCA.

Hope you enjoy it!


submitted by /u/wangyi_fudan
[link] [comments]

Source: Reddit Machine Learning

Need suggestion: First-day work as Data Analyst under the Risk Management Department

I am a Fresh graduate from Chemical Engineering who pivoting to Data Fields. So, I got accepted as a Data Analyst under the Risk Management Department on Financial Technology Company that focused on peer-to-peer lending and I will start to work on Monday, Jan-27.

I am really nervous due to this is my first experience working on real-world's data. Can I have some suggestions/advice? What I really need to prepare? what will I really do at work? and other's suggestion not related to Data Works is fine tho.

Any comments are appreciated, Thank you in advance.

submitted by /u/ebuzz168
[link] [comments]

Source: Reddit Data Science

[D] Warning: Colab has started silently disallowing access to TPUs and GPUs if you use it often

I've been using Colab for a while and love it, however, after playing a little with Jax today and trying to restart a notebook I find myself unable to start a Runtime with either a TPU or a GPU. The error looks like this.

It seems like this is only an issue with my account, and it seems likely that they are silently limiting the accounts of people who use Colab too much. I am not sure if this was triggered by recent usage (I haven't actually used it quite as much recently – I was using it more until about a week ago) or if someone at Google saw my viral post about finetuning gpt2-1.5b for a long time using colab, which is my guess.

At any rate, watch your usage if you use Colab a lot, and possibly look at alternatives for when the same happens to you. A good option in some cases is Kaggle, however, I am specifically starting a TPU project so this isn't an option for me.

submitted by /u/Tenoke
[link] [comments]

Source: Reddit Machine Learning

[D] Training GANs & Nash equilibrium

I have a theoretical question on the meaning of training a GAN: the goal of the training process is to find a Nash equilibrium or to oscillate around it? If the answer is the latter, why do we keep oscillating around a stable state? Shouldn't we just end up in a stable state? Any reference is very welcome!

submitted by /u/albdemens
[link] [comments]

Source: Reddit Machine Learning

[R] Looking for paper on history of ML model "failures"

A while ago I stumbled upon a paper (vague memory) that talked about a few cases where ML models did not live up to their hype e.g. the Google Flu model, that while being sophisticated merely predicted cold seasons. However, after a lot of googling I just can't find it anymore.
Maybe someone knows what I am looking for or knows about related papers? Would be greatly appreciated!

submitted by /u/hegelsmind
[link] [comments]

Source: Reddit Machine Learning

[D] can an outlier elimination process cause data leakage?

I'm currently working with a dataset with 90,000 samples and ~40 features. I was meticulous at keeping the test set strictly away from the training set to prevent any data leakage, except in one place: outlier pruning with z-score.

I know it depends on data, but is there any possibility that the model is learning anything from this process? My intuition says yes, but I would like to hear others' as well.

Thank you in advance.

submitted by /u/cymetric10
[link] [comments]

Source: Reddit Machine Learning

Latest Posts