Monday, February 20, 2012

Georg Cantor and the foundations of Set Theory.


Since I am on the topic of Venn Diagrams, I decided to review the history of "set theory". Given that a Venn Diagram is just a graphical representation of sets of data, producing an accurate Venn Diagram requires understanding the underlying data. While the sets of data presented in Venn Diagrams are normally finite, the whole theory of sets and the interest to mathematicians to develop the theory derived from debates over the possibility of infinity. The German mathematician Georg Cantor created the concept of set theory. Prior to Cantor, mathematicians and philosophers recognized sets of data but only considered they were possible in limited (finite) amounts. Cantor's work proved it is possible to have infinite amounts of data within a set. He also proved that different sets of infinity are not equinumerous. Hence if two sets are presented and they both have infinite data, is is possible that both sets may not have the same amount of data. Part of his proof showed that there are an infinite amount of real numbers and an infinite amount of natural numbers but there are more real numbers than infinite numbers. His proofs were controversial to philosophers of his day, but future mathematicians embraced his work and developed it into the modern concept of set theory. Thus what is normally taught as a beginning skill in statistics, actually has its roots as a topic that changed philosophy and the world view of mathematics.

For deeper information on Georg Cantor, see the Wikipedia article, or read a biography.

MIT OCW Exercise for Ch 1.

I started on the first exercise for the MIT book. Since part of my goal with this class is to produce the answers using GNU R, the time to produce the answer was not very time efficient. I expect time to improve as I progress over the learning curve of the VennDiagram package.

Suppose that A ⊂ B. Show that Bc ⊂ Ac.
The subset is the white area while Ac is both white and yellow areas.



I would like to thanks DWin from StackExchange for his guidance on the VennDiagram package.

Improvements I should consider for this diagram.
  1. Add the label for universal set U (which would equal the area for Bc ).
  2. Fill in the universal set U with a color other than white.
  3. Remove the number values as they are not important for this exercise.
Of course I am open to your comments to improve this diagram.

Monday, February 13, 2012

MIT Open Courseware statistics class, and GNU R

It has been a long time since I performed any serious statistical problems. To update my skills within my own time I am participating in the MIT OpenCourseWare undergraduate level statistics class, but with a twist. I will use GNU R and it's packages for the exercises within the book. I have been using GNU R on and off for my own use. This gives me a better opportunity to build skills with GNU R while strengthening my statistical skills.

So the first chapter in the course book covers Venn diagrams. While simple Venn diagrams are easy to produce, readable complex ones require some skill in art or assistance of a program designed to make them. GNU R seems to have two popular packages to perform this task: VennEuler and VennDiagram.

VennEuler is good for getting started with semi-simple diagrams. You will note that the name mentions both Venn and Euler diagrams. They are similar diagrams whose main difference is how three or more overlapping circles handle area with null data. BMC Bioinformatics posted an example between the two diagrams below.

Note that figure A) (Euler diagram) minimizes overlap show the region that does not share data does not appear. Figure B) (traditional Venn diagram) provides a larger overlap of data but must show a region of zero or null data since the data set does not share any common properties among the three regions.

The GNU R code for a simple Venn diagram below
require(venneuler)
v <- venneuler(c(A=200, B=200, "A&B"=100))
v$labels<- c("Green", "Blue")
plot(v)
text(.5, .6, "my text here")
 
produces the following Venn diagram.
While VennEuler is good for simple Venn diagrams, the package VennDiagram gives the user greater control over the diagrams. Here is an example of a complex Venn diagram from BMC Bioinformatics.
The code for this diagram follows
library(VennDiagram);
venn.diagram(
    x = list(
        I = c(1:60, 61:105, 106:140, 141:160, 166:175, 176:180, 181:205, 206:220),
        IV = c(531:605, 476:530, 336:375, 376:405, 181:205, 206:220, 166:175, 176:180),
        II = c(61:105, 106:140, 181:205, 206:220, 221:285, 286:335, 336:375, 376:405),
        III = c(406:475, 286:335, 106:140, 141:160, 166:175, 181:205, 336:375, 476:530)
        ),
    filename = "1D-quadruple_Venn.tiff",
    col = "black",
    lty = "dotted",
    lwd = 4,
    fill = c("cornflowerblue", "green", "yellow", "darkorchid1"),
    alpha = 0.50,
    label.col = c("orange", "white", "darkorchid4", "white", "white", "white", "white", "white", "darkblue", "white", "white", "white", "white", "darkgreen", "white"),
    cex = 2.5,
    fontfamily = "serif",
    fontface = "bold",
    cat.col = c("darkblue", "darkgreen", "orange", "darkorchid4"),
    cat.cex = 2.5,
    cat.fontfamily = "serif"
    );


Reviewing the sample code and the available documentation for both packages, the VennDiagram package contains a larger library of statements for granular control of the diagram. As I perform the exercises in the statistics book, I will attempt a mix of diagrams from each package and provide examples of what I learned from the chapter and from the use of both GNU R packages.

Tuesday, June 28, 2011

Differences in Linear Programming models

With my mind back on to the Science of Decision Making book, I have two different ways to build the same linear programming model. The book describes how to do so using Excel. So below is a screen shot of the model in Excel before it is resolved. E3:E8 uses the function of SUMPRODUCT of the related values in columns B, C, and D.





Seeking to maximize E8, it is subject to 
$B$9:$D$9 >= 0
$E$3:$E$7 <= $G$3:$G$7
with the changing cells $B$9:$D$9. 
This results in the answer below.
Produce 
20 of S
30 of F
0 of L
















Now to do the same problem using the GNU Linear Programming Kit (GLPK), the model is built as follows.
--------
# This finds the optimal solution for maximizing the RV plants's profit
#

/* Decision variables */
var x1 >=0;  /* Standard */
var x2 >=0;  /* Fancy */
var x3 >=0;  /* Luxury */

/* Objective function */
maximize z: 840*x1 + 1120*x2 + 1200*x3;

/* Constraints */
s.t. Engine_Shop : 3*x1 + 2*x2 + 1*x3 <= 120;
s.t. Body_Shop   : 1*x1 + 2*x2 + 3*x3 <= 80;
s.t. Standard_Fin : 2*x1 + 0*x2 + 0*x3 <= 96;
s.t. Fancy_Fin : 0*x1 + 3*x2 + 0*x3 <= 102;
s.t. Luxury_Fin : 0*x1 + 0*x2 + 2*x3 <= 40;

end;
--------
While spreadsheet fans may find this slightly harder to follow, the format in this tool is closer to how a student defines a linear programming problem in class, only slightly more complex given scripting variables. GLPK produces the following report after running "glpsol -m glpk_Science_of_decision_Ch001.mod -o test.sol"

--------
Problem:    glpk_Science_of_decision_Ch001
Rows:       6
Columns:    3
Non-zeros:  12
Status:     OPTIMAL
Objective:  z = 50400 (MAXimum)

   No.   Row name   St   Activity     Lower bound   Upper bound    Marginal
------ ------------ -- ------------- ------------- ------------- -------------
     1 z            B          50400                             
     2 Engine_Shop  NU           120                         120           140
     3 Body_Shop    NU            80                          80           420
     4 Standard_Fin B             40                          96 
     5 Fancy_Fin    B             90                         102 
     6 Luxury_Fin   B              0                          40 

   No. Column name  St   Activity     Lower bound   Upper bound    Marginal
------ ------------ -- ------------- ------------- ------------- -------------
     1 x1           B             20             0               
     2 x2           B             30             0               
     3 x3           NL             0             0                        -200

Karush-Kuhn-Tucker optimality conditions:

KKT.PE: max.abs.err. = 0.00e+000 on row 0
        max.rel.err. = 0.00e+000 on row 0
        High quality

KKT.PB: max.abs.err. = 0.00e+000 on row 0
        max.rel.err. = 0.00e+000 on row 0
        High quality

KKT.DE: max.abs.err. = 1.14e-013 on column 1
        max.rel.err. = 1.35e-016 on column 1
        High quality

KKT.DB: max.abs.err. = 0.00e+000 on row 0
        max.rel.err. = 0.00e+000 on row 0
        High quality

End of output
--------


While the report introduces a greater amount of information, we can still see the same results. 
The company should produce
20 of the standard
30 of the fancy
0 of the luxury

The Google Docs spreadsheet looks almost the same as Excel for the purposes of this post so a screenshot of it is excluded. The only difference between the two for this example is how the constraints are setup. See the previous post for the details. 

Resolved Google Docs solver issue

Well it seems the Solver in Google Docs spreadsheet can produce the same answer as Excel.
The issue comes down to how you enter the constraints. Excel can enter constraints as a range: $E$3:$E$7 <= $G$3:$G$7 but in Google Docs Spreadsheet, a range can not be used. Each constraining pair must be entered as a separate constraint.
Hence
E3 <= G3
E4 <= G4
E5 <= G5
E6 <= G6
E7 <= G7

So as I make my way through the book and try the examples in both Excel and Google Docs, I am sure I will find other issues.

Saturday, June 11, 2011

Using Solver tool in Excel and Google Docs

While I was working on the first problem in The Science of Decision Making book, it dawned on me that not only does Excel have a solver tool for Linear Programming problems, but Google docs spreadsheet also has a solver. So I tried the problem in both tools. Unfortunately I got different results. I tried the same steps over and over again to confirm if I make a mistake. So far they do not reconcile.

Results in Excel (which match the results in the book)


So I will email Google to find out if there is something I am missing.   

Science of Decision Making book and the many tools to optimize a decision

One of the great benefits of blogging is stating a goal and using your public announcement to motivate you to finish the goal. If the writer (me) does not accomplish the goal, it's embarrassing and becomes rather boring blogging. Hence my first post is about my goal to complete the work within the book The Science of Decision Making: A Problem-Based Approach Using Excel. Only in my case, I am looking to process the examples in both Excel and the program GNU Linear Programming Kit.

So why do I want to do this ? In my past experience with computer security I found a lot of peers, journals, and experts talked about risk, but few could talk about risk in a similar manner to an insurance actuary. Many expert technology companies and government institutions have a difficult time determining risk and making optimal decisions. I do not make any claims that I am any better at making decisions than the management of these organizations at this time. My interest is building skills that could give me the ability to determine security risks in a similar pattern to an insurance company. I am hoping that these skills will give me an advantage in any complex decision I need to make. At very least, I can use the book's content to build decision making skills in a way that is rational and provable. Of course if data collection and life was perfect, I could use these skills to solve every problem thrown at me like Charlie Eppes in the TV show Numb3rs, but the realistic goal would be to match the accuracy of insurance companies for IT security or other complex decisions.