From d9d9c77f9a0351bd8f9af9dd5620ba3d40456ff8 Mon Sep 17 00:00:00 2001 From: OXPHOS Date: Tue, 22 Mar 2016 19:30:44 -0400 Subject: [PATCH 1/3] add proposals --- 2016/proposals/deng-pan.md | 106 +++++++++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 2016/proposals/deng-pan.md diff --git a/2016/proposals/deng-pan.md b/2016/proposals/deng-pan.md new file mode 100644 index 00000000..4c2d09ac --- /dev/null +++ b/2016/proposals/deng-pan.md @@ -0,0 +1,106 @@ +#Integrating `pandas.Panel` and xarray Features + +## Abstract + +Pandas package excels at processing tabular models, especially 1-D and 2-D models with data structure `pd.Series` and `pd.DataFrame`. However, despite the specialized data structure for 3-D and N-D data, `pd.Panel`, pandas is not well-designed for high dimensional data processing. Xarray was then developed to compensate the weak part of pandas. Xarray package enabled statistic and mathematical analysis of high-dimensional data, and implemented several new features that are missing in pandas. However, some useful features in pandas were not implemented in xarray yet. Thus, my project will be focusing on porting features from pandas to xarray, and improving data structure conversion between pandas and xarray. By the end of my work, I expect pandas/xarray users can have specialized tools for different types of data processing, and migrate between pandas and xarray smoothly. + +## Technical Details + +Most of my proposal is supposed to be carried out with current implemented features in pandas and xarray. For PCA part, to improve the performance, I might switch to C++ and Eigen3 library. + +## Schedule of Deliverables + +###Milestone 1: Migrate features from pandas to xarray +###Week 1 - 5, May 23rd - June 26th + +I will have my coursework finished within the first two weeks +I will implement more multiIndex-related features and port features from pandas to xarray. + + - Week 1: Enable multi-datatype in `xr.DataArray` like `pd.panel` + - Week 2: Enable groupby for one-dimensional variables / Make levels accessible as coordinate variables + - Week 3: Based on [#702](https://github.com/pydata/xarray/pull/702) and according to [#719](https://github.com/pydata/xarray/issues/719), I will implement selection return objects with MultiIndex, and add `set_index`/`reset_index`/`swaplevel` to make it easier to create and manipulate multi-indexes. + - Week 4 - 5: Port features from pd.panel to `xr.DataArray`, such as `prod`, `cumsum`, `rank`, and etc.. Ref [#791](https://github.com/pydata/xarray/issues/791) + +###Milestone 2: Implement interface between pandas and xarray data structure +###Week 6 - 9, June 27th - July 24th + +To ensure that data can migrate between pandas and xarray data structure smoothly, I want to spend time on optimizing the conversion between data types. I will check current conversion methods, and implement more features - from multi-indexed/ hierarchical-indexed `pd.DataFrame` to `xr.DataSet`; from xr.DataSet to 2-D pd.DataFrame with PCA. + + - Week 6: Test current conversion methods with newly added features in Milestone1 and fix existing bugs. + - Week 7: Based on [#702](https://github.com/pydata/xarray/pull/702) and Milestone1, I am going to take one step further and implement conversion from multiIndexed `pd.DataFrame` to `xr.DataSet`. + - Week 8 - 9: It might be useful to compress multi-dimensional metadata into 2-dimensional or 3-dimentional dataset and view in matplotlib. I am going to implement PCA method in xarray, and enable conversion from `xr.DataSet` to `pd.DataFrame` via PCA with certain data types. + +###Milestone 3: Port xarray features to pandas +###Week 10 - 11, July 25th - Aug 7st + +`pd.DataFrame` currently supports basic `.loc` indexing. I am going to introduce `xr.loc` back into pd.DataFrame to enable dictionary syntax for indexing. +I will also check for other useful features in xarray and migrate back to pandas. + +###Milestone 4: Clean-up and Wrap-up +###Week 12 - 13, Aug. 8th - Aug. 23th + +Checked for and fix opening issues for xarray. +Clean-up codes and finish documentations. + +## Future works +I would like to continue working on pandas and xarray project, to make pandas and xarray better libraries. + +## Open Source Development Experience +I have been using several open-source packages for my lab projects, while I just started to contribute to open-source projects. I fixed several bugs for Pandas package in Python, and contributed patches for Shogun, a C++ - based machine learning toolkit. +I have gained a lot from my limited experience. First, trying to solve problems kicked me out of my comfort zone and drove me to learn new skills. Also, reading other developers' codes and talking to people interactively greatly improved my insights in coding. Meanwhile, I started to learn the rules for developing and maintaining giant projects. Thus, open-source projects development experience benefitted and will benefit me a lot. + +##Patch samples: + ####Pandas: + [#12614](https://github.com/pydata/pandas/pull/12614) Fixed bug: `pd.crosstab` `margins=True` ignoring dropna + [#12650](https://github.com/pydata/pandas/pull/12650) Fixed bug: `pd.pivot_table` `dropna=False` drops columns/index names + ####Shogun: (a C++ - based machine learning toolkit) + [#3092](https://github.com/shogun-toolbox/shogun/pull/3092) Project-wise FLAG removal + [#3096](https://github.com/shogun-toolbox/shogun/pull/3092) API: add mean computation to linear algebra library + +## Informatics skills + +###Python/Numpy/Pandas + ####COURSES: + Fundamentals of Computing Specialization (Coursera, License: 5ePAwImNEe) + Bioinformatics Algorithms (Coursera) + ####PROJECTS: + Genome-wide Mitochondrial Targeting Prediction in C. elegans + RNA-seq Analysis of MSAF-1 Mutant Gene Expression Profile in C. elegans + +###Linear Algebra/Statistics + ####COURSES: + Linear Algebra and Analytical Geometry (Undergraduate, grade 95/100) + Probability and Statistics (Undergraduate, grade 99/100), + Stochastic Mathematical Methods (Undergraduate, grade 98/100) + Quantitative Genomics and Genetics (Graduate, Honor) + ####PROJECTS: + S phase genes identification and pathway enrichment study via microarray and singular value decomposition (SVD) + Genome-wide Association Study (GWAS) on Casual Polymorphisms Regulating Gene CG9186 Expression at Pupation Stage in Drosophila melanogaster + +###Other skills: R, Matlab, Java, C++, bash + +## Academic Experience + +Sept. 2013 - present + Memorial Sloan Kettering Cancer Center + Graduate Research Assistant, + Cell Department, Laboratory of Dr. Cole Haynes + +Sept. 2012 - present + Weill Cornell Graduate School of Medical Sciences + Graduate student, + Biochemistry, Cell and Molecular Biology Allied Program + +Aug. 2008 - Jul. 2012 + Tsinghua University, China + Undergraduate Student, School of Life Sciences, Major in Biological Sciences + Rank: 6/68 + Graduate with Honor + Distinguished Dissertation + +## Why this project? + +I have used the pandas package before in my lab project - deep sequencing data analysis. I also started to contribute to pandas by fixing bugs: `pd.crosstab margins` ignoring `dropna` and `pd.pivot_table` `dropna=False` drops columns/index names. +I am interested in becoming a data scientist, in which career python and pandas are powerful and popular tools. I would like to join the development of pandas, because it helps me understand pandas better, as well as improves my coding skills and insights. +I chose to join the panel/xarray project because the project offers me the chance to review and study the general features and advantages/disadvantages of both pandas and xarray packages. + From fe6ca6a8e3990daf0f3d0a788373a495d857b1f6 Mon Sep 17 00:00:00 2001 From: OXPHOS Date: Tue, 22 Mar 2016 19:35:38 -0400 Subject: [PATCH 2/3] GSoC16 Proposal Draft --- 2016/proposals/deng-pan.md | 129 ++++++++++++++++++++----------------- 1 file changed, 70 insertions(+), 59 deletions(-) diff --git a/2016/proposals/deng-pan.md b/2016/proposals/deng-pan.md index 4c2d09ac..7d6a770f 100644 --- a/2016/proposals/deng-pan.md +++ b/2016/proposals/deng-pan.md @@ -2,105 +2,116 @@ ## Abstract -Pandas package excels at processing tabular models, especially 1-D and 2-D models with data structure `pd.Series` and `pd.DataFrame`. However, despite the specialized data structure for 3-D and N-D data, `pd.Panel`, pandas is not well-designed for high dimensional data processing. Xarray was then developed to compensate the weak part of pandas. Xarray package enabled statistic and mathematical analysis of high-dimensional data, and implemented several new features that are missing in pandas. However, some useful features in pandas were not implemented in xarray yet. Thus, my project will be focusing on porting features from pandas to xarray, and improving data structure conversion between pandas and xarray. By the end of my work, I expect pandas/xarray users can have specialized tools for different types of data processing, and migrate between pandas and xarray smoothly. +Pandas package excels at processing tabular models, especially 1-D and 2-D models with data structure `pd.Series` and `pd.DataFrame`. However, despite the specialized data structure for 3-D and N-D data, `pd.Panel`, pandas is not well-designed for high dimensional data processing. Xarray was then developed to compensate the weak part of pandas. Xarray package enabled statistic and mathematical analysis of high-dimensional data, and implemented several new features that are missing in pandas. + +However, some useful features in pandas were not implemented in xarray yet. Thus, my project will be focusing on porting features from pandas to xarray, and improving data structure conversion between pandas and xarray. By the end of my work, I expect pandas/xarray users can have specialized tools for different types of data processing, and migrate between pandas and xarray smoothly. ## Technical Details -Most of my proposal is supposed to be carried out with current implemented features in pandas and xarray. For PCA part, to improve the performance, I might switch to C++ and Eigen3 library. +Most of my proposal is supposed to be carried out with current implemented features in pandas and xarray. For PCA part, to improve the performance, I might switch to C++ and Eigen3 library. + ## Schedule of Deliverables ###Milestone 1: Migrate features from pandas to xarray -###Week 1 - 5, May 23rd - June 26th -I will have my coursework finished within the first two weeks -I will implement more multiIndex-related features and port features from pandas to xarray. +**Week 1 - 5, May 23rd - June 26th** - - Week 1: Enable multi-datatype in `xr.DataArray` like `pd.panel` - - Week 2: Enable groupby for one-dimensional variables / Make levels accessible as coordinate variables - - Week 3: Based on [#702](https://github.com/pydata/xarray/pull/702) and according to [#719](https://github.com/pydata/xarray/issues/719), I will implement selection return objects with MultiIndex, and add `set_index`/`reset_index`/`swaplevel` to make it easier to create and manipulate multi-indexes. - - Week 4 - 5: Port features from pd.panel to `xr.DataArray`, such as `prod`, `cumsum`, `rank`, and etc.. Ref [#791](https://github.com/pydata/xarray/issues/791) + - I will have my coursework finished within the first two weeks + - I will implement more multiIndex-related features and port features from pandas to xarray. -###Milestone 2: Implement interface between pandas and xarray data structure -###Week 6 - 9, June 27th - July 24th + - **Week 1:** Enable multi-datatype in `xr.DataArray` like `pd.panel` + - **Week 2:** Enable groupby for one-dimensional variables / Make levels accessible as coordinate variables + - **Week 3:** Based on [#702](https://github.com/pydata/xarray/pull/702) and according to [#719](https://github.com/pydata/xarray/issues/719), I will implement selection return objects with MultiIndex, and add `set_index`/`reset_index`/`swaplevel` to make it easier to create and manipulate multi-indexes. + - **Week 4 - 5:** Port features from pd.panel to `xr.DataArray`, such as `prod`, `cumsum`, `rank`, and etc.. Ref [#791](https://github.com/pydata/xarray/issues/791) -To ensure that data can migrate between pandas and xarray data structure smoothly, I want to spend time on optimizing the conversion between data types. I will check current conversion methods, and implement more features - from multi-indexed/ hierarchical-indexed `pd.DataFrame` to `xr.DataSet`; from xr.DataSet to 2-D pd.DataFrame with PCA. +###Milestone 2: Type conversion between pandas and xarray data structure - - Week 6: Test current conversion methods with newly added features in Milestone1 and fix existing bugs. - - Week 7: Based on [#702](https://github.com/pydata/xarray/pull/702) and Milestone1, I am going to take one step further and implement conversion from multiIndexed `pd.DataFrame` to `xr.DataSet`. - - Week 8 - 9: It might be useful to compress multi-dimensional metadata into 2-dimensional or 3-dimentional dataset and view in matplotlib. I am going to implement PCA method in xarray, and enable conversion from `xr.DataSet` to `pd.DataFrame` via PCA with certain data types. +**Week 6 - 9, June 27th - July 24th** + + - To ensure that data can migrate between pandas and xarray data structure smoothly, I want to spend time on optimizing the conversion between data types. I will check current conversion methods, and implement more features - from multi-indexed/ hierarchical-indexed `pd.DataFrame` to `xr.DataSet`; from xr.DataSet to 2-D pd.DataFrame with PCA. + + - **Week 6:** Test current conversion methods with newly added features in Milestone1 and fix existing bugs. + - **Week 7:** Based on [#702](https://github.com/pydata/xarray/pull/702) and Milestone1, I am going to take one step further and implement conversion from multiIndexed `pd.DataFrame` to `xr.DataSet`. + - **Week 8 - 9:** It might be useful to compress multi-dimensional metadata into 2-dimensional or 3-dimentional dataset and view in matplotlib. I am going to implement PCA method in xarray, and enable conversion from `xr.DataSet` to `pd.DataFrame` via PCA with certain data types. ###Milestone 3: Port xarray features to pandas -###Week 10 - 11, July 25th - Aug 7st -`pd.DataFrame` currently supports basic `.loc` indexing. I am going to introduce `xr.loc` back into pd.DataFrame to enable dictionary syntax for indexing. -I will also check for other useful features in xarray and migrate back to pandas. +**Week 10 - 11, July 25th - Aug 7th** + + - `pd.DataFrame` currently supports basic `.loc` indexing. I am going to introduce `xr.loc` back into pd.DataFrame to enable dictionary syntax for indexing. + - I will also check for other useful features in xarray and migrate back to pandas. ###Milestone 4: Clean-up and Wrap-up -###Week 12 - 13, Aug. 8th - Aug. 23th -Checked for and fix opening issues for xarray. -Clean-up codes and finish documentations. +**Week 12 - 13, Aug. 8th - Aug. 23rd** + + - Checked for and fix opening issues for xarray. + - Clean-up codes and finish documentations. ## Future works I would like to continue working on pandas and xarray project, to make pandas and xarray better libraries. ## Open Source Development Experience -I have been using several open-source packages for my lab projects, while I just started to contribute to open-source projects. I fixed several bugs for Pandas package in Python, and contributed patches for Shogun, a C++ - based machine learning toolkit. +I have been using several open-source packages for my lab projects, while I just started to contribute to open-source projects. I fixed several bugs for Pandas package in Python, and contributed patches for Shogun, a C++ - based machine learning toolkit. + I have gained a lot from my limited experience. First, trying to solve problems kicked me out of my comfort zone and drove me to learn new skills. Also, reading other developers' codes and talking to people interactively greatly improved my insights in coding. Meanwhile, I started to learn the rules for developing and maintaining giant projects. Thus, open-source projects development experience benefitted and will benefit me a lot. -##Patch samples: - ####Pandas: - [#12614](https://github.com/pydata/pandas/pull/12614) Fixed bug: `pd.crosstab` `margins=True` ignoring dropna - [#12650](https://github.com/pydata/pandas/pull/12650) Fixed bug: `pd.pivot_table` `dropna=False` drops columns/index names - ####Shogun: (a C++ - based machine learning toolkit) - [#3092](https://github.com/shogun-toolbox/shogun/pull/3092) Project-wise FLAG removal - [#3096](https://github.com/shogun-toolbox/shogun/pull/3092) API: add mean computation to linear algebra library +##Patch samples + - **Pandas:** + - [#12614](https://github.com/pydata/pandas/pull/12614) Fixed bug: `pd.crosstab` `margins=True` ignoring dropna + - [#12650](https://github.com/pydata/pandas/pull/12650) Fixed bug: `pd.pivot_table` `dropna=False` drops columns/index names + + - **Shogun:** (a C++ - based machine learning toolkit) + - [#3092](https://github.com/shogun-toolbox/shogun/pull/3092) Project-wise FLAG removal + - [#3096](https://github.com/shogun-toolbox/shogun/pull/3092) API: add mean computation to linear algebra library ## Informatics skills -###Python/Numpy/Pandas - ####COURSES: - Fundamentals of Computing Specialization (Coursera, License: 5ePAwImNEe) - Bioinformatics Algorithms (Coursera) - ####PROJECTS: - Genome-wide Mitochondrial Targeting Prediction in C. elegans - RNA-seq Analysis of MSAF-1 Mutant Gene Expression Profile in C. elegans +**Python/Numpy/Pandas** + - COURSES: + - Fundamentals of Computing Specialization (Coursera, License: 5ePAwImNEe) + - Bioinformatics Algorithms (Coursera) + - PROJECTS: + - Genome-wide Mitochondrial Targeting Prediction in C. elegans + - RNA-seq Analysis of MSAF-1 Mutant Gene Expression Profile in C. elegans -###Linear Algebra/Statistics - ####COURSES: - Linear Algebra and Analytical Geometry (Undergraduate, grade 95/100) - Probability and Statistics (Undergraduate, grade 99/100), - Stochastic Mathematical Methods (Undergraduate, grade 98/100) - Quantitative Genomics and Genetics (Graduate, Honor) - ####PROJECTS: - S phase genes identification and pathway enrichment study via microarray and singular value decomposition (SVD) - Genome-wide Association Study (GWAS) on Casual Polymorphisms Regulating Gene CG9186 Expression at Pupation Stage in Drosophila melanogaster - -###Other skills: R, Matlab, Java, C++, bash +**Linear Algebra/Statistics** + - COURSES: + - Linear Algebra and Analytical Geometry (Undergraduate, grade 95/100) + - Probability and Statistics (Undergraduate, grade 99/100) + - Stochastic Mathematical Methods (Undergraduate, grade 98/100) + - Quantitative Genomics and Genetics (Graduate, Honor) + - PROJECTS: + - S phase genes identification and pathway enrichment study via microarray and singular value decomposition (SVD) + - Genome-wide Association Study (GWAS) on Casual Polymorphisms Regulating Gene CG9186 Expression at Pupation Stage in Drosophila melanogaster + +**Other skills: R, Matlab, Java, C++, bash** ## Academic Experience Sept. 2013 - present - Memorial Sloan Kettering Cancer Center - Graduate Research Assistant, - Cell Department, Laboratory of Dr. Cole Haynes + - **Memorial Sloan Kettering Cancer Center** + - Graduate Research Assistant, + - Cell Department, Laboratory of Dr. Cole Haynes Sept. 2012 - present - Weill Cornell Graduate School of Medical Sciences - Graduate student, - Biochemistry, Cell and Molecular Biology Allied Program + - **Weill Cornell Graduate School of Medical Sciences** + - Graduate student, + - Biochemistry, Cell and Molecular Biology Allied Program Aug. 2008 - Jul. 2012 - Tsinghua University, China - Undergraduate Student, School of Life Sciences, Major in Biological Sciences - Rank: 6/68 - Graduate with Honor - Distinguished Dissertation + - **Tsinghua University, China** + - Undergraduate Student, School of Life Sciences, Major in Biological Sciences + - Graduate with Honor + - Distinguished Dissertation ## Why this project? -I have used the pandas package before in my lab project - deep sequencing data analysis. I also started to contribute to pandas by fixing bugs: `pd.crosstab margins` ignoring `dropna` and `pd.pivot_table` `dropna=False` drops columns/index names. +I have used the pandas package before in my lab project - deep sequencing data analysis. I also started to contribute to pandas by fixing bugs: `pd.crosstab margins` ignoring `dropna` and `pd.pivot_table` `dropna=False` drops columns/index names. + I am interested in becoming a data scientist, in which career python and pandas are powerful and popular tools. I would like to join the development of pandas, because it helps me understand pandas better, as well as improves my coding skills and insights. + I chose to join the panel/xarray project because the project offers me the chance to review and study the general features and advantages/disadvantages of both pandas and xarray packages. + From 73d96bd135f6b66a4ad4c2f5d0c954f7b2d6f75d Mon Sep 17 00:00:00 2001 From: OXPHOS Date: Tue, 22 Mar 2016 20:20:19 -0400 Subject: [PATCH 3/3] Add contact info --- 2016/proposals/deng-pan.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/2016/proposals/deng-pan.md b/2016/proposals/deng-pan.md index 7d6a770f..2d0f4364 100644 --- a/2016/proposals/deng-pan.md +++ b/2016/proposals/deng-pan.md @@ -112,6 +112,19 @@ I have used the pandas package before in my lab project - deep sequencing data a I am interested in becoming a data scientist, in which career python and pandas are powerful and popular tools. I would like to join the development of pandas, because it helps me understand pandas better, as well as improves my coding skills and insights. -I chose to join the panel/xarray project because the project offers me the chance to review and study the general features and advantages/disadvantages of both pandas and xarray packages. +I look forward to joining the panel/xarray project because the project offers me the chance to review and study the general features and advantages/disadvantages of both pandas and xarray packages. I believe I can benefit a lot from finishing the work and so do pandas/xarray. + +## Contact information +Name: Pan Deng + +E-mail: engelzora@gmail.com + +Github: OXPHOS + +IRC: OXPHOS + +Time zone: UTC-05:00 + +