Easy way to make a pivot table

I often work with databases and sometimes I need to transform raw data into pivot table. Many people doesn’t know it might be done by using pivot_table() from pandas library.

Let’s say we have a table where id and key1_N are keys. And we want to transform it into a new form like this:

pivot_tables

How to do it:

import pandas as pd

df = pd.DataFrame({
 'id': [1,1,1,2,2,2],
 'key1_N': [1,2,3,1,2,3],
 'var1': [3,4,8,3,4,8],
 'var2': [5,1,6,6,1,1]
 })
df = df.pivot_table(index='id', columns='key1_N', values=['var1', 'var2'])

After using this operation we get a table with two-level columns. But we can change the value of df.columns:

import itertools 
df.columns = list([x[0] + '_' + str(x[1]) for x in itertools.product(df.columns.levels[0], df.columns.levels[1])])

I’ve moved to the US

My wife and I moved to the US a few weeks ago and now we live in San Jose, CA. It’s an amazing place, where I really love to be. Now I explore local life and looking for a job in machine learning field. I believe I can be very valuable for any tech company in this area. So if you read this post and you have any relevant positions write to me please 🙂

A few pictures from our new californian life:

IMG_8139

IMG_8171

IMG_8155

HiveQL: Partitioned Tables

I work on a project where we have a really huge dataset and we use a Hadoop cluster. I often need to select data from a specific table where one of columns is date. If table is really huge it takes a lot of time to make queries like this: “select * from table_name where day=’2015-05-01′”.

It’s better to use partitioned tables in this case (HiveQL querie example):

CREATE TABLE schema.table_A (field1 int, field2 string, field3 float)
PARTITIONED BY (day) stored as orc

Selecting data from this type of tables is much faster, because each day is actually stored in a separate file on a disk.

If we want to insert data into this table with overwriting we should use a query like this:

INSERT OVERWRITE TABLE schema.table_A PARTITION (day = '2015-05-01')
select * from table_B

 

 

How to find prime numbers

A simple function that allow to find first N prime numbers:

def primes(n):
    
    numbers = set(range(2, n))
    
    for j in range(2, int(np.sqrt(n))):
        numbers -= set(range(2*j, n, j))
    
    return sorted(list(numbers))

print(primes(n=100))

 

How to get a list of dates

Let’s say we have two different dates and we want to make a list of all dates that are between them. Here is an easy way to do it:

import datetime

date1 = datetime.datetime.strptime("2015-04-05", "%Y-%m-%d").date()
date2 = datetime.datetime.strptime("2015-06-30", "%Y-%m-%d").date()

DATES = [str(date1+datetime.timedelta(days=i)) for i in range((date2 - date1).days + 1) if date1+datetime.timedelta(days=i) <= date2]