Accurate Campaign Targeting Using Classification Algorithms

Stanford CS229 – Machine Learning project.

Algorithms used: Multivariate Linear Regression (MLR), Support Vector Machine (SVM), Tree Method and Random Forest, Neural Network.

Poster:

Paper:

Good Benchmarking Practices

In order to prepare for a project, I’ve been looking into benchmarking tips online recently. I found many interesting articles, some of what they suggested had been practiced by me in two jobs: 1) IT consultant for an Engineering company in San Jose, USA; 2) KPI researcher for an Engineering Consulting company in Shanghai; some of them are new and inspiring. I’m sharing to benefit more colleagues.

1. What I’ve done in the past and turned out to be beneficial:

Run a pilot project initially, perhaps with another branch/site of your organization, so that you can iron out any wrinkles before you engage with an external partner.
Use a questionnaire. Seek a balance of qualitative and quantitative information when designing it and remember to test it first internally. Only ask questions that you’re willing to answer yourself.
Get the planning right, including research. Careful preparation at strategic and operational levels is vital before you move into implementation. As well as doing benchmarking right, it’s vital to benchmark the right things. This means making sure your project addresses one or more of your organisation’s broader business goals. Key outcomes will include increased customer satisfaction and cycle time, improved quality, reduced waste, higher productivity and higher cost savings.

2. What I failed to do:

Pick the right benchmarking partners. You need to ensure they are as serious about benchmarking as you are and their process is comparable. It’s hard to let someone else fill a questionnaire when it’s not part of their job. So better invest time in gaining co-operation and help from them, otherwise it’s highly likely that you collect invalid data in the end.

3. What are inspiring, and I’ll do them in future:

Make sure you have the right project team in place to deliver your benchmarking. They need to be highly motivated and open to change with good communication skills – most of all they have to be credible individuals. Spend time gaining buy-in from those internal customers who stand to benefit most from your benchmarking project.
Integrate your benchmarking with your other business improvement efforts. Its purpose is to measure, compare and improve so make sure it ties in with your other change initiatives.

Source of reference: http://www.thecqi.org/Knowledge-Hub/Quality-express/archives/Quality-updates/benchmarking-tips/

Some Cool Single-Line Functions in Python

Recently did many exercises with Python. I have to say it’s really a beautiful language! Lots of cool stuff could be achieved with a single-line function.

Here are some examples. I will add more when I come across them. Please add comments if you think there’s even a better way!

1. Write a function that takes a list, and returns a dictionary with keys the elements of the list and as value the number of occurances of that element in the list:

def count_list(l): return { key: l.count(key) for key in l }

2. Reverse look-up: Write a function that takes a dictionary and a value, and returns the key associated with this value.

def reverse_map(dict, v): return {value: key for key, value in dict.items()}[v]

3. Print the numbers 1 to 100 that are divisible by 5 but not by 3:

Method 1:

x1=range(101)

print filter(lambda x1: x1 % 5 == 0 and x1 % 3 != 0, x1)

Method 2:

[i for i in range(101) if i%5==0 and i%3!=0]

4. Loop over elements and indexes of a list, print them in a given form:

myList=[1, 2, 4]

for index, elem in enumerate(myList): print ‘{0} ) {1}’.format(index, elem)

result:

0) 1

1) 2

2) 4

5. Nearest neighbor – Write a function that takes a value z and an array A and finds the element in A that is closest to z. The function should return the closest value, not index

def find_nearest(a, a0):

return a.flat[np.abs(a – a0).argmin()]

**More to come!

Uncovering Best Practices for Data-Driven Facilities Energy Management

Stanford CIFE 2014 Funding Candidate Proposal

Energy costs represent about 20% of total operating expenditures for office buildings.

Our case studies of more than fifty buildings have demonstrated that savings of up to 54% are achievable with very little capital investment and little to no distraction for building occupants. As with any mechanical system, building systems performance decline with time and require regular maintenance to work as expected. However, there is no established literature on optimal maintenance strategy based on real performance data from real world projects. Using performance and asset data from our industry partners, we propose to examine the impacts of building systems maintenance on building performance. Additionally, we will identify processes by which the design and construction phases of the project can contribute to implementing optimal facility maintenance strategies.

As a goal of the study, we will compile a set of best practices for asset management and building systems management to maximize occupant comfort and energy efficiency while minimizing costs of operating the system.

Graph Database Technologies: Neo4j and Cypher Query Language

I. Neo4j

Neo4j is an open-source graph database implemented in Java, its data are stored in graphs rather than in tables. This way it allows more efficient visualization and analysis for graph-based relationships. Neo4j was first created in Sweden in 2007, and now widely used for many customer-based websites (see http://neo4j.com/customers/ for a list).

(picture from: http://www.slideshare.net/emileifrem/an-intro-to-neo4j-and-some-use-cases-jfokus-2011)

Graph chart is useful for many cases:

model social network with nodes and relationships, and properties on both.
calculate the shortest path between two nodes. For example, the websites are nodes, the links between them are edges, and the calculation tells us how many links we need from one website to another website.
find nodes with similar activities patterns. For example, an online dating website can use this attribute to find people with same in-degree and out-degree for certain activities.
match complicated relationship. For example, we want to find all the Youtube users with a specific register date who operated on a certain video channel which is subscribed by more than 10M users.

(picture from Neo4j 8/14/2014 webinar)

See more use cases, please visit http://neo4j.com/use-cases/

Advantages of Neo4j:

We can see that in traditional SQL DB environment the above queries could take a lot of code and a lot of memory to calculate. But in Neo4j and Cypher Language, minimum code and optimized performance could be achieved.

If you know a little bit of computer systems, you’ll also understand that Neo4j is a system-friendly and optimized product.

In Neo4j, nodes and relationships are stored in separate files. Each node is stored as constant length record with pointer to first property and first relationship. Each relationship is stored as constant length record with pointer to previous and next relationship. This constant length record allows fast look-up. (from Stanford CS145 summer class)

<– Neo4j server web interface

II. Cypher

So what is Cypher?

Cypher is a declarative graph query language for Neo4j. It’s a SQL-like language, but allows for more expressive and efficient querying and updating of the graph store.

The syntax is very intuitive:

 MATCH (Person { name:'Charlie Sheen' })-[:ACTED_IN]-(movie:Movie)
   RETURN movie

In the above we find all the movies acted by actors (Person) with the name “Charlie Sheen”.

Below example enables us to find all nodes which have no more than three layers of relationships between them:

MATCH p = shortestPath (( a ) -[*..3] -( b ) )
where a <> b
RETURN a . name , b . name , length ( p )

Does this remind you of LinkedIn?

See more powerful functions of Cypher, refer to the reference card. bit.ly/cypher-refcard

There’re also advanced track where people do more powerful and creative stuff. Below is in a Graph Database meetup in Chicage where the presenter showed how to hook up to one of the social networks (facebook, twitter, linkedin) and import profiles and relationships in to your graph.

Looking forward to more exciting development in graph database.

RCharts example

Rcharts is a wonderful tool for data analysis and visualization. A month ago I tried all the examples in Rcharts gallery. Here is an example I built:

Adult obesity rate map with hover showing state names.

I can change the color scale, data input, and link data from SQL server to R using .csv files. Check out more example on the gallery website: http://rcharts.io/gallery/

D3 Visualization Example

Last week in my Data Science internship, I created some visualizations with D3. I used two github open source libraries “Circular Heat Chart” and “stackpercent”. This is a wonderful material for presenting multi-dimensional data!

In the chart above, the chart is representing 3 time blocks in a day:

Midnight – 4am, 4am – 8am, 8am-noon

The labels are very easy to update, so there’s tons of flexibility to use this to monitor data of any time periods. And different colors represent different status, you can define whatever you like.

Below are two sample charts I made:

Documentation:

1. Purpose: Display a single day’s six shifts’ coverage quota.

2. Tools: D3, JS, CSS

3. Variable: four colors representing “Committed”, “One-time Covered”, “One-time Cancel”, “Never Filled”.

The color settings are stored as variables in “day.js”

4. Input requirement: pass in a variable comprised of elements corresponding to color variable names. For example:

var shifData = [

cover, cover, cover, cover, cover, cover, // The most inner ring, clockwise

cover, cover, one_time, cover, cover, cover, // The second inner ring, clockwise

one_time, cover, one_time, cover, cover, cover,

cancel, one_time, cancel, cover, cover, one_time,

cancel, one_time, never, one_time, one_time, cancel,

never, cancel, never, cancel, never, never // The most outer ring, clockwise

];

Documentation:

1. Purpose: Display a single day’s six shifts’ coverage quota.

2. Tools: JS, CSS, opensource (https://github.com/skeleton9/flot.stackpercent)

3. Variable:

label: representing a shift coverage state

data-width: the amount/quota of this particular state

data-height: the order of shift, i.e. 1 for 1st shift, 2 for 2nd shift

color: visualization for this shift coverage state

4. Input requirement: pass in a variable comprised of four objects, each with “label”, “data”, and “color”. For example:

var data = [

{“label”:”Commited”, “data”: [[12, 1], [10, 2], [7,3], [7,4], [7,5], [7,6]], “color”:”green”}, // [12,1] here stands for 12 Commited doctors for the 1st shift

{“label”:”Commited(leave)”, “data”:[[11,1], [9, 2], [8,3], [7,4], [7,5], [7,6]], “color”:”#83BA4F”},

{“label”:”Cancel(one-time)”, “data”:[[10,1], [6, 2], [5,3], [7,4], [7,5], [7,6]], “color”:”#E78800″},

{“label”:”No Fill Ever”, “data”:[[10,1], [6, 2], [5,3], [7,4], [7,5], [7,6]], “color”:”red”}

];

Python Web Application – AuctionBase (adding bids like Ebay!)

These three weeks I built a web application using Python (web.py and jinja2), and SQLite. It’s a website where you can add bids to open items, browse items of interest, and search item details (started date, ends date, current price, who sells it, how many bids, and descriptions, etc.)

There’re three milestones for this project:

1. Database preparation

Design Database table schemas, so that the database takes as little space as possible to store, while at the same time has the best organization to manage, update, delete, and search;
Writing Python script to convert .JSON files to data files (.dat);
Load into SQLite database, write queries to check the performance of loading;

2. Setting integrity constraints

Writing SQL constraints, triggers, to ensure integrity of the data in database;
Testing triggers, load triggers into database;

3. Web application and interface

Writing web.py scripts and jinja2 templates to set up user interface – input values, select current time, search item ID, browse item of interest based on criteria, and add bids;
Using Transactions to ensure robustness of performances;
Error management: when an error on search/add bid occurs, it gets tackled neatly, gives error message, and allows the user to keep interacting with the website;
Testing constraints and triggers to ensure normal bidding behaviors.

Below are some screenshots to show the project progress:

–Before organizing the data, a mess for human eyes–

–Many files and codes got created and operated on the data–

–Magic happens after all the programing tools–

I love Computer Science!

Excel data model & visualization – 02. Yearly Trend Reporting Dashboard

Includes:
1) Key findings of important factors by year;
2) stack columns graph showing the cumulative growing trend;
3) pie chart showing contribution by different sectors;
4) level of different state (although not geographical, it serves a much better tool to compare);
5) despite the cumulative amount, the bar chart of units number in each year (amount of LEED projects in each year, to serve as a reference for the corresponding year’s total saving amount level);
6) stack column showing two different contributors’ yearly contribution.

This is a summary of the whole model document, with succinct information presented with graphs. Users can check details from previous tabs, which I will cover in my future Excel series.

Excel data model & visualization – 01. Sum Table and Interactive Growth Trend

I built an Excel model for USGBC in 2013 for calculating water savings by LEED buildings. The model was totally build from scratch, based on 17,000 data entries of LEED certified buildings. All the calculations and visualizations were done using Excel functions and graph tools.

data visualization work example

It has information about
– each year’s saving contribution to the current
– each year’s accumulative level
– total savings
– validation (adding total from two different directions and check if the results are the same)

interactive graph

User can choose growth rate to see auto-updated cumulative water saving calculation model. Growth rate is defined as new LEED buildings of the year compared to new LEED buildings of the previous year.

Here I chose 20% growth rate for 2014, and 30% rate for 2015. A sharp increase in 2015 could be observed in the graph. It’s due to the growing amount of 2015 projects, while the contribution from projects of previous year remain at the same level (with degradation, ~1%).

2020 graph using the same method, where the growth rates are all set to be 5% from 2014 to 2020.

Data Mining, Business Intelligence

A topnotch WordPress.com site