The closest things we have to an AI

The science of extracting information from textual data has changed dramatically over the past decade. As the term Natural Language Processing took over Text Mining as the name of this field, the methodology used has changed tremendously, too. One of the main drivers of this change was the emergence of language models as a basis for many applications aiming to distill valuable insights from raw text.

I recently completed the Natural Language Processing Specialization on Coursera created by the deeplearning.ai team. One of the things I was fascinated by is the evolution of language models in the past years. I…


A Data Scientist’s take on defending Machine Learning models

source: https://pasadenaweekly.com/permanent-record-is-accused-spy-edward-snowdens-defense-brief-to-the-american-people/

Introduction

I’ve recently read Edward Snowden’s Permanent Record during my holiday. I think it is a great book that I highly recommend for basically anyone, however it is particularly interesting for IT-folks for the obvious reasons. It is a great story about a guy growing up together with the internet, starts to serve his country in a patriotic fervour after 9/11, and becomes a whistleblower when he notices the US has gone too far violating privacy in the name of security. Moreover, a paradox I found most interesting is something a Data Scientist can easily relate to.

The systems that collect…


Support your argument with data!

source: https://assets.weforum.org/article/image/large_9M3Mr4G6HoxKpUnt1kMxQ2FOH5Z0dOy0aRWwumfWFmw.jpg

The subject of gender, particularly gender inequality, has generated a lot of debate recently. This post aims to provide helpful insights for anyone who’d like to study gender proportions in specific fields. I will provide some tips for data collection using web scraping as well as an automated way of finding probable gender of a person based on first names.

Data collection

If you are lucky, you may have your data in a handy format, like excel or .csv from some source. Nevertheless, this is rarely the case. In most analyses, you have to collect your data — generally from a website…


Natural Language Processing is a catchy phrase these days

This is Part 2 of a pair of tutorials on text pre-processing in python. In the first part, I laid out the theoretical foundations. In this second part, I’ll demonstrate the steps described in Part 1 in python on texts in different languages while discussing their differing effect arising from different structures of languages.

If you haven’t, you should first read Part 1!

You can check out the code on GitHub!

The author is a guest author working at Starschema Ltd. Check out their publication too!

Relevance

In the first part, I outlined text pre-processing principles based on a framework from…


Part II — Case Study

Natural Language Processing is a catchy phrase these days

This is Part 2 of a pair of tutorials on text pre-processing in python. In the first part, I laid out the theoretical foundations. In this second part, I’ll demonstrate the steps described in Part 1 in python on texts in different languages while discussing their differing effect arising from different structures of languages.

If you haven’t, you should first read Part 1!

You can check out the code on GitHub!

Relevance

In the first part, I outlined text pre-processing principles based on a framework from an academic article. The underlying goal of all these techniques was to reduce text data dimensionality but keep the relevant information incorporated in the text. In this second part, I will present the effect of the following techniques on two central properties of text, word count and unique word count — the latter representing the dimensionality of text data:

  1. Removing stopwords
  2. Removing both extremely frequent and infrequent words
  3. Stemming…


Part I — Theoretical Background

This is Part 1 of a pair of tutorials on text pre-processing in python. In this first part, I’ll lay out the theoretical foundations. In the second part, I’ll demonstrate the steps described below in python on texts in different languages while discussing their differing effect arising from different structures of languages.

Introduction

Text mining is an important topic for business and academia as well. Business applications include chatbots, automated news tagging and many others. In terms of academia, political texts are often the subjects of rigorous analysis. …


A vector map of Budapest. https://www.shutterstock.com/image-vector/black-white-vector-city-map-budapest-1035519106

Ever wondered how to draw a map of less common geographical areas? And color them based on some data? This pair of tutorials shows how to build this from scratch! First, you need to construct the border of your polygons — Part 1 is about this task. After that you need to create a map, and color those polygons according to some value of your interest. That will be shown in Part 2.

Part 1 of this tutorial is available here.

There are many tutorials on the internet for drawing maps in Python, even more sophisticated maps like heatmaps (where…


A map of Budapest. Source: https://hebstreits.com/product/budapest-hungary-downtown-vector-map/

Ever wondered how to draw a map of less common geographical areas? Perhaps even colour them based on some data? This is the first in a series of two tutorials that show you how to build this from scratch! First, you need to construct the border of your polygons — Part 1 is about this task. After that you need to create a map, and color those polygons according to some value of your interest. That will be shown in Part 2.

There are many tutorials on the internet for drawing maps in Python, even more sophisticated maps like heatmaps…


Source: https://aperiodical.com/tag/statistics/

Have you ever wondered how combining weak predictors can yield a strong predictor? Ensemble Learning is the answer! This is the second of a pair of articles in which I will explore ensemble learning and bootstrapping, both the theoretical basics and real-life use cases. Let’s do this!

Part 1 of this blogpost about bootstrapping is available here!

You can look at the code on my GitHub:

Bagging: combining regression/classification trees

Now that we understood the principle underlying bootstrapping, let’s see how to make use of it an a machine learning context. Bagging stands for boostrap aggregating. …


Source: https://aperiodical.com/tag/statistics/

Have you ever wondered how combining weak predictors can yield a strong predictor? Ensemble Learning is the answer! This is the first of a pair of articles in which I will explore ensemble learning and bootstrapping, both the theoretical basics and real-life use cases. Let’s do this!

You can look at the code on my GitHub:

Bootstrapping

Bootstrapping is a resampling method where observations are drawn from a sample with replacement. Let’s say you have 1,000 data points, and you create 100 distinct samples of 1,000 data points each by drawing from the original sample only, with replacement. …

Mor Kapronczay

ML&NLP Engineer @ Bold360AI. Text mining and predictive statistics enthusiast.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store