YDT Blog

In the YDT blog you'll find the latest news about the community, tutorials, helpful resources and much more! React to the news with the emotion stickers and have fun!
Bottom image

Making big moves in Big Data with Hadoop, Hive, Parquet, Hue and Docker

Jump and run in this brief introduction to Big Data

What data at most big companies in 2020 looks like. Seriously.

The goal of this article is to introduce you to some key concepts in the buzzword realm of Big Data. After reading this article — potentially with some additional googling — you should be able to (more or less) understand how this whole Hadoop thing works.

So, to be more precise in this article you will:

  • Learn a lot of definitions (yay)
  • Spin up a Hadoop cluster with a few bells and whistles via docker-compose.
  • Understand Parquet files and how to convert your csv datasets into Parquet files.
  • Run SQL (technically HiveQL but it is very similar) queries on your Parquet files like it’s nobody’s business with Hive.
  • Also, be expected to have some basic knowledge in Docker & docker-compose, running Python scripts etc. — nothing crazy but it is better if you know in advance.

If you are already using an alternative to Hadoop’s ecosystem — cool — this article is more geared towards readers that have to get familiar with Hadoop due to work, university etc. and is just one “solution” to the Big Data conundrum with its pros and cons respectively.

Big Data: this is the big one. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Now you might ask the simple question (and a lot of people have): “How big is big data?” Well, to be frank that is a very hard question and depending on how fast technology moves, what one considers “big” data today might be “small” data tomorrow. Nonetheless, the definition above is pretty timeless since it refers to sizes beyond the ability of commonly used tools — this is your reference line; so in 2020 let’s bite the bullet and say the following is true: when you start dealing with double digit TB datasets in your DBs and above, you are probably hitting the limits of some of the more run-of-the-mill tools and it is maybe time to look into distributed computing and potentially this article.

Hadoop: a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. The 3 most important core modules you should know about are:

  • (Storage) HDFS: a distributed file system with high-throughput access and redundancy (copies of all files are maintained across the cluster) at the same time
  • (Resource Management) YARN a.k.a. Yet Another Ressource Negotiator: a framework for job scheduling and cluster resources management e.g. which nodes are available etc.
  • (Processing) MapReduce: a YARN-based system for parallel processing of large datasets. This is the main algorithm used to seamlessly distribute computational tasks across the nodes of the cluster. You can read upon the origins of MapReduce around the Web. Popular alternatives are TEZ and Spark, which were developed later for processing data even more efficiently.
Parts of the Hadoop Ecosystem in one diagram. Focus on HDFS, YARN, MapReduce and Hive for now.

Hive: a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. So, basically Hive sits on top of the aforementioned Hadoop stack and it allows you to directly use SQL on your cluster.

Hue: an open source SQL Assistant for Databases & Data Warehouses i.e. an easy GUI for looking into HDFS, Hive etc. Very handy for beginners! It is maintained by Cloudera and you can find it on GitHub.

Parquet: a columnar storage* format available to any project in the Hadoop ecosystem. You can read why this is a good idea with big data sets in the explanation below. Again, there are a lot of alternatives but this technology is free, open-source and widely used in production across the industry.

*columnar storage: in normal row-based DBs e.g. MySQL you are storing data in rows, which are then distributed across different blocks if you cannot fit your data on one block. Now if you have a lot of columns and rows in your data distributed across multiple blocks, things can get pretty slow. Which is why you could instead store each column in a separate block. In that case you can access all the data of a column by just accessing one block. There is a great longer explanation on the concept here. As AWS puts it (unrelated to Parquet but still true): “Columnar storage for database tables is an important factor in optimizing analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk.” Here is also another comparison to CSV, which shows how much storage you can save and what kind of speedups you can expect.

Just a great Jaws reference and this is here solely to brighten your mood if the article is already too much.

Building something

Ok. Time to build something! Namely a Hadoop cluster with Hive on top of it to run SQL queries on Parquet files stored in HDFS all the while visualizing everything in Hue. Does this sentence make more sense now than in the beginning of the article? Cool.

There are a lot of ways to do this and Hadoop is famous for operating computer clusters built from “commodity hardware” but since this is just a learning exercise, it is a bit easier to quickly spin up a small Hadoop cluster with the aforementioned stack in Docker with docker-compose — you can of course do this in Kubernetes too but it is way beyond the scope of this article. This setup here will not be even close to production but then again this article should merely serve as a gateway for your big data journey.

Here is the repo for this article:


A nicer view of the docker-compose file in the repository, which should roughly sketch out the architecture of the project.

Some highlights from the implementation in Docker

This was not quite as straightforward as initially thought, so here are some pointers from the development in case you want to customize this further. The focus of this article is not to give you a crash course in Docker and docker-compose either, so this section is brief and only highlights some places where you could get stuck.

  • If you want anything to work with Hue you need to figure out to override Hue’s default configs by mounting the hue-overrides.ini file (you will find it in the repo and the override in docker-compose). Obvious right? Wink, wink.
  • In hue-overrides.ini you should be looking at: [[database]] => this is the internal Hue DB, [[hdfs_clusters]] => connecting to HDFS to view files in Hue, [[yarn_clusters]] => setting up YARN and [beeswax] => connecting to Hive to run SQL queries in Hue.
  • If you do not have this line thrift_version=7 in hue-overrides.ini, Hue will refuse to connect to the Hive (=Thrift) Server because it is defaulting to a Hive Server version that is too high. This took hours.
  • If you use Hue’s default SQLite DB, you will get the message “database locked” when you try to connect to Hive => that is why there is a db-hue Postgres DB in the docker-compose file. Something about SQLite not being suitable for a multi-threaded environment as described here. Cloudera should work on their error messages…
  • POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD in hadoop-hive.env can be used with the official postgres Docker image to directly create the DB user when you spin up your container. Check.
  • Watch out with your 5432 port and not exposing it multiple times since PGDB is running more than once for this project (once as a metastore for Hive and once as a DB for hue)

tl;dr on the next steps

Ok. Short summary on what will happen next for the impatient engineers:

  1. Start Hue, Hive and your Hadoop nodes with docker-compose up
  2. Download a .csv dataset from Kaggle & convert it with the supplied Python script
  3. Import said Parquet file to HDFS via Hue & preview it to make sure things are OK
  4. Create empty Hive table with the Parquet file schema after inspecting the schema with the parquet-tools CLI tool
  5. Import file from HDFS into Hive’s table
  6. Run some SQL queries!

Starting the cluster and launching Hue with docker-compose

Well, since everything is already setup, just clone the repository on your computer and type docker-compose up in your terminal. That’s it. Then go to localhost:8888 and you should (after setting up the initial password for Hue) see this screen:

This screen shows you your Hadoop cluster’s HDFS while the sidebar shows you the DB tables in Hive’s metastore — both of which are empty in this case.

Uploading Parquet files on HDFS and previewing them in Hue

When trying to open some (quite a few) Parquet files in Hue you are going to get the following error message:

“Failed to read Parquet file”

and in your docker-compose logs:

NameError: global name ‘snappy’ is not defined

Turns out that Hue does not support the snappy compression that is the default for a lot of Parquet converting tools like pandas. There is no workaround for this except for recreating your Parquet files (if they are using snappy). Worst UX ever on Cloudera’s side…

In the GitHub repository you will find a parquet_converter.py, which uses pandas and specifies the compression as None, so one does not default to snappy that will subsequently breaking Hue. Meaning you can take any dataset from e.g. Kaggle in .csv format and convert it to Parquet with the provided Python module.

At this point — if you are unfraid of the CLI — the best suggestion is for you to forget Hue and just use Hive and HDFS directly for your Parquet files. But if you stuck around with Hue like me you could see a UK Punctuality Statistics report from Kaggle that was converted with the Python script mentioned above and then uploaded as a file:

The File Browser in Hue when you click on the successfully imported Parquet file. You can access the File Browser from the dark sidebar on the left.

Creating a Hive table from your Parquet file and schema

After seeing that your data was properly imported, you can create your Hive table. For this you should run the following command in your command line in the folder where you converted your file (probably /your_github_clone/data):

parquet-tools schema 201801_Punctuality_Statistics_Full_Analysis.parquet

This will output the schema you need to create the table in Hue (UTF8 = string in Hue):

message schema {
optional binary run_date (UTF8);
optional int64 reporting_period;
optional binary reporting_airport (UTF8);
optional binary origin_destination_country (UTF8);

Time to create your table:

The preview of the new table you created. Go the DB icon in the dark sidebar and create a new table manually with the schema as described above. Afterwards click on the import button in the dark sidebar and import your Parquet file into the empty table. Afterwards you should see the above screen.

Running SQL queries

Running SQL queries was promised and it shall be delivered. The first icon in Hue’S sidebar is its query editor.

What if you wanted to find out all the flights from Poland that had an average delay of more than 10 minutes?

SELECT * FROM `2018_punctuality_statistics` WHERE origin_destination_country=’POLAND’ AND average_delay_mins>=10;

The auto-complete feature in the editor is fantastic, so even if you are a SQL novice, you should be able to play around with your data without too much effort.


Time to let go

Dear reader, sadly you have reached the end of this article. Please write in the comments below if you feel like this journey should have a sequel. So, to recap you learned how to run a Hadoop cluster with Hive for running SQL queries all the while visualizing everything in Hue with docker-compose. Not bad.

This is of course a very very simplified look into what is possible with Hadoop but chances are that you are just starting out in the field, so give yourself some time and build on this knowledge and infrastructure. Furthermore, there are great online courses out there that you can look into next.


Making big moves in Big Data with Hadoop, Hive, Parquet, Hue and Docker was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Toward Data

Inequality: How to draw a Lorenz curve with SQL, BigQuery, and Data Studio

The top 0.1% of all Wikipedia pages earn 25% of the pageviews. The bottom 99% only get 42% of all the views. And the bottom 80% — only get 4%. This is just one example — in this post we’ll review how to get these numbers for this and any other dataset.

How can we get these numbers out of a SQL table?

Step 1: Your data, BigQuery, and SQL

First — I’ll define my dataset. In this case, it’s all the pageviews in the English Wikipedia during December 2019 — after removing all the special pages:

WITH wiki_prefixes AS (SELECT ['File:', 'Talk:', 'Template_talk:', 'Wikipedia:', 'Category:', 'User_talk:', 'Page:', 'Template:', 'Category_talk:'
, 'User:', 'Author:', 'Portal:', 'Wikipedia_talk:', 'Portal_talk:', 'File_talk:', 'Draft:', 'Help:', 'Draft_talk:', 'en:', 'Book_talk:', 'Module:', 'MOS:'
, 'Special:', 'Book:'] x)

, data AS (
FROM `fh-bigquery.wikipedia_extracts.201912_en_totals`
WHERE title NOT IN ('-', 'Main_Page')
title NOT LIKE '%:%'
OR REGEXP_EXTRACT(title, '[^:]*:') NOT IN UNNEST((SELECT(x) FROM wiki_prefixes))

You need to give a consecutive row number to each row in this dataset, ordered by the # of pageviews:

SELECT title, views, ROW_NUMBER() OVER (ORDER BY views) rn
FROM data

Now you can use that row number to divide all the pages in 1,000 different buckets:

SELECT 1+fhoffa.x.int(rn/(SELECT (1+COUNT(*))/1000 FROM data)) bucket
, COUNT(*) pages
, SUM(views) views
, STRING_AGG(title ORDER BY views DESC LIMIT 3) sample_titles
SELECT title, views, ROW_NUMBER() OVER (ORDER BY views) rn
FROM data

To get a bucket I have a subquery on the first line:(SELECT (1+COUNT(*))/1000 FROM data) . That gets me a number based on the total number of rows out of my dataset, and by dividing each row number by this value, we get 1k different buckets, from 0 to 999. Each of these buckets will have the SUM(views) of its pages, and STRING_AGG(title ORDER BY views DESC LIMIT 3) serves to keep some sample titles to identify each bucket.

Now we will surround this query with a new one calculating the cumulative number of views while we go through the 1k buckets, and also the total # of pageviews from all these pages:

SELECT SUM(views) OVER(ORDER BY bucket) cum_views
, SUM(views) OVER() total_views

The way to get the cumulative vs the grand total for each row is the OVER(ORDER BY bucket) vs just OVER().

Next step: Take the cumulative number of views, and divide them by the total:

SELECT ROUND(100*cum_views/total_views,3) cum_percent

And now we have a cumulative percent we can chart in Data Studio. Let’s put everything together, creating a new table:

CREATE TABLE `wikipedia_extracts.201912_lorenz_curve`
WITH wiki_prefixes AS (SELECT ['File:', 'Talk:', 'Template_talk:', 'Wikipedia:', 'Category:', 'User_talk:', 'Page:', 'Template:', 'Category_talk:' , 'User:', 'Author:', 'Portal:', 'Wikipedia_talk:', 'Portal_talk:', 'File_talk:', 'Draft:', 'Help:', 'Draft_talk:', 'en:', 'Book_talk:', 'Module:', 'MOS:', 'Special:', 'Book:'] x)

, data AS (
FROM `fh-bigquery.wikipedia_extracts.201912_en_totals`
WHERE title NOT IN ('-', 'Main_Page')
title NOT LIKE '%:%'
OR REGEXP_EXTRACT(title, '[^:]*:') NOT IN UNNEST((SELECT(x) FROM wiki_prefixes))
SELECT ROUND(100*cum_views/total_views,3) cum_percent, *
SELECT SUM(views) OVER(ORDER BY bucket) cum_views, *, SUM(views) OVER() total_views
SELECT 1+fhoffa.x.int(rn/(SELECT (1+COUNT(*))/1000 FROM data)) bucket, COUNT(*) pages, SUM(views) views
, STRING_AGG(title ORDER BY views DESC LIMIT 3) sample_titles
SELECT title, views, ROW_NUMBER() OVER (ORDER BY views) rn
FROM data
# 34.1 sec elapsed, 805.8 MB processed)
Buckets #991 to #1000, the top 1%

Step 2: Visualizing in Data Studio

Go to the new table you just created in BigQuery, and follow these steps:

Explore with Data Studio

→ Save

→ Create new report and share

Explore with Data Studio → Save → Create new report and share

→ Add to report

→ Add a chart

→ Scatter

→ X:bucket Y:cum_percent

Add to report → Add a chart → Scatter → X:bucket Y:cum_percent

And that’s how you can get a pretty Lorenz curve with Data Studio and BigQuery in a few steps:



Wikipedia just announced their 6 millionth article — and in this report we have counted more than 11 million pages. Note that this report is based on the pageviews Wikipedia reports hourly from their logs (loaded in BigQuery) — and the queries already remove most of the “special” pages I could find. Let me know if you know a better way to clean up these logs.

To balance this, note that I’m not counting any of the pages that got 0 views during December — we are looking here at pageview logs, so zeros won’t show up.

Next steps

  • To calculate the Gini coefficient with BigQuery, check this post from Evgeny Medvedev:

Calculating Gini Coefficient in BigQuery

Interactive: The top 2019 Wikipedia pages

Want more?

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.

Inequality: How to draw a Lorenz curve with SQL, BigQuery, and Data Studio was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Toward Data

Latest Posts