Adding gsoc proposals for this year's students
Damian Johnson

Damian Johnson commited on 2011-05-13 17:51:39
Zeige 2 geänderte Dateien mit 235 Einfügungen und 0 Löschungen.

... ...
@@ -212,6 +212,10 @@
212 212
       <li><h4><a href="http://tor.spanning-tree.org/proposal.html">DNSEL Rewrite</a> by Harry Bock</h4></li>
213 213
       <li><h4><a href="http://kjb.homeunix.com/wp-content/uploads/2010/05/KevinBerry-GSoC2010-TorProposal.html">Extending Tor Network Metrics</a> by Kevin Berry</h4></li>
214 214
       <li><h4><a href="../about/gsocProposal/gsoc10-proposal-soat.txt">SOAT Expansion</a> by John Schanck</h4></li>
215
+      <li><h4><a href="http://inspirated.com/uploads/tor-gsoc-11.pdf">GTK+ Frontend and Client Mode Improvements for arm</a> by Kamran Khan</h4></li>
216
+      <li><h4><a href="http://www.gsathya.in/gsoc11.html">Orbot + ORLib</a> by Sathya Gunasekaran</h4></li>
217
+      <li><h4><a href="http://blanu.net/TorSummerOfCodeProposal.pdf">Blocking-resistant Transport Evaluation Framework</a> by Brandon Wiley</h4></li>
218
+      <li><h4><a href="../about/gsocProposal/gsoc11-proposal-metadataToolkit.txt">Metadata Anonymisation Toolkit</a> by Julien Voisin</h4></li>
215 219
       <li><h4><a href="http://www.atagar.com/misc/gsocBlog09/">Website Pootle Translation</a> by Damian Johnson</h4></li>
216 220
     </ul>
217 221
     
... ...
@@ -0,0 +1,231 @@
1
+Hello,
2
+I am Julien Voisin, undergraduate computer science student from France.
3
+
4
+I am interested to work on the “Meta-data anonymizing toolkit for file publication” project.
5
+I know there is already a student interested in by this project, but I really want to do it :
6
+I needed it for my own and have already thought about a potential design some time ago.
7
+
8
+I would like to work for the EFF, because I am very concerned about privacy issues on the Internet.
9
+I think privacy is an essential right, and not just an option.
10
+Especially, I would really enjoy to work for the Tor project (or Tails, since it's heavily based on him).
11
+I am using it for quite some time and would like to get more involved and contribute back!
12
+
13
+I use F/OSS on a daily basis (Ubuntu, Debian, Archlinux and Gentoo).
14
+So far my major contributions were the writing of documentations for archLinux, openmw, xda-forum and Ubuntu.
15
+Recently I have released a little matrix manipulation library written in C,
16
+originally for an academic project (http://dustri.org/lib/).
17
+I am interested to do the debian package, but heard that it can be quite tricky, so a little
18
+help would be much appreciated
19
+
20
+I do not have any major plan for this summer (but my holidays only begins the june 4th), so I can fully focus on the project and reasonably think that I could commit 5-6 hours per day on it.
21
+
22
+Requirement/Deliverables:
23
+
24
+    * A command line and a GUI tool having both the following capabilities (in order of importance):
25
+          o Listing the metadatas embedded in a given file
26
+          o A batch mode to handle a whole directory (or set of directories)
27
+          o The ability to scan files packed in the most common archive formats
28
+          o A nice binding for srm (Secure ReMoval) or shred (GNU utils) to properly remove the original file containing the evil metas
29
+          o Let the user delete/modify a specific meta
30
+
31
+    * Should run on the most common OS/architectures (And especially on Debian Squeeze, since Tails is based on it.)
32
+    * The whole thing should be easily extensible (especially it should be easy to add support for new file formats)
33
+    * The proper functioning of the software should be easily testable
34
+
35
+I'd like to do this project in Python, because I already have done some personal projects whith it (for which I also used subversion) : an IRC bot tailored for logging  (dustri.org/tor/depluie.tar.bz2 still under heavy WIP), a battery monitor, a simple search engine indexing FTP servers, ...
36
+
37
+Why is Python a good choice for implementing this project ?
38
+
39
+   1. I am experienced with the language
40
+   2. There are plenty of libraries to read/write metadatas, among them is Hachoir (https://bitbucket.org/haypo/hachoir/)that looks very promising since it supports quite a few file formats
41
+   3. It is easy to wrap other libraries for our needs (even if they are not written in Python !)
42
+   4. Runs on almost every OS/architecture, what is a great benefit for portability
43
+   5. It is easy to make unit tests (thanks to the built-in Unittest module)
44
+
45
+Proposed design:
46
+
47
+The proposed design has three main components : one lib, a command line tool and a GUI tool.
48
+
49
+The aim of the library (described with more details in the next part) is to make the development of tools easy. A special attention will be made on the API that it exposes. The ultimate goal being to be able to add the support of new file format in the library without changing the whole code of the tools.
50
+
51
+Meta reading/writing library :
52
+
53
+A library to read and write metas for various file formats. The main goal is to provide an abstraction interface (for the file format and for the underlying libraries used).
54
+At first it would only wrap Hachoir.
55
+Why hachoir :
56
+
57
+    * Autofix: Hachoir is able to open invalid / truncated files
58
+    * Lazy: Open a file is very fast since no information is read from file, data are read and/or computed when the user ask for it
59
+
60
+    * Types: Hachoir has many predefined field types (integer, bit, string, etc.) and supports string with charset (ISO-8859-1, UTF-8, UTF-16, ...)
61
+    * Addresses and sizes are stored in bit, so flags are stored as classic fields
62
+    * Editor: Using Hachoir representation of data, you can edit, insert, remove data and then save in a new file.
63
+    * Meta : Support a very large scale of file format
64
+
65
+But we could also wrap other libraries to support a particular file format. Or write ourself the support for a format, although this should be avoided if possible (it looks simple at first, but supporting different versions of the format and maintaining the thing over time is extremely time consuming)
66
+The must would be to make the children libraries optional dependencies.
67
+
68
+One typical use case of the lib is to ask for metadatas for a file, if the format is supported a list (or maybe a tree) of metas is returned.
69
+
70
+Both the GUI and the cmdline tool will use this lib.
71
+
72
+The cmdline/GUI tool features:
73
+
74
+    * List all the meta
75
+    * Removing all the meta
76
+    * Anonymising all the meta
77
+    * Let the user chose wich meta he wants to modify
78
+    * Support archives anonymisation
79
+    * Secure removal
80
+    * Cleaning wholes folder recursively
81
+
82
+GUI:
83
+Essentially the GUI tool would do the same features as for the cmd line too.
84
+I do not have a significant GUI development experience, but I'm planing to fix that point during community bonding period.
85
+
86
+Proposed development methods:
87
+One way to develop this would be to do it layer by layer : first implementing the meta reading/writing lib for ale the formats in one shot, then making the command line application, …
88
+
89
+However for this project, developing feature by feature seems more appropriate :
90
+Starting by a skeleton implementing a thin slice of functionality that traverses most of the layers.
91
+For example, I could start by focussing only on EXIF metas : make sure that the meta reading/writing library supports EXIF, then make the command line tool using the previous library.
92
+And only then, when the skeleton is actually working, add supports for other features/formats.
93
+
94
+This allows a more incremental development flow and after only a few weeks I would be able to deliver a working system. The list of features supported in the first iterations would be ridiculously short but at least that would enable me to get feedbacks and notice quickly if I am on the wrong tracks.
95
+Also, since I would add one feature at a time, the structure of the system will tend to easily accommodate that. Thanks to this, adding the support of a new file format will be made easy, even after the end of the GSOC.
96
+
97
+
98
+Testing
99
+For such a tool, since the smallest crack could compromise the user, testing is critically important. I plan to implement two main kind of testings :
100
+
101
+Unit tests
102
+
103
+To test the proper working of the meta accessing library. Maintaining a collection of files with metas for every format we should support. Using Python's unittest I can setup expectations to make sure that the library will finds the good fields.
104
+
105
+End-to-End testing:
106
+
107
+A script (or set of script) to test the proper working of the command line tool. One way is to run the tool in batch mode on the input test files set. If in the output we are still able to find metas, the system is not doing his job right.
108
+
109
+It is possible to write this script in Python, possibly using again the unittest lib (even if here the goal is to execute an external executable and see how it's behaving) to get output of the results that is consistent with the unit tests.
110
+
111
+The goal is to be able to automatically test the whole system in a few commands (one for the unit tests one for the end-to-end tests). It will be easy to add a new document in the test set, so if someone from the community provides a file that was not cleaned properly, we can easily reproduce the problem  and then decide on the proper action to take. The ability to tests everything easily might help a user to make sure that everything works fine on his OS/architecture. Also, if for one reason after some months it is decided to change from Hachoir to another underlying library (yes a pretty radical decision, just given as example) we still have the tests, so valuable to check than everything is still working with the new library.
112
+
113
+Timeline:
114
+
115
+    * Community Bonding Period (in order of importance)
116
+          o Playing around with pygobject
117
+          o Playing with Hachoir
118
+          o Learning git
119
+
120
+    * First two weeks :
121
+          o create the structure in the repository (directories, README, ..)
122
+          o Create a skeleton
123
+
124
+          o Objectives : to have a deployable working system as soon as possible(even if the list of features is ridiculous). So that I can show you my work in an incremental way thereafter and get feedbacks early.
125
+          o The lib will handle reading/writing EXIF fields (using Hachoir)
126
+          o A set of tests files (and automated unit tests) to demonstrate that the lib does the job
127
+          o The beginning of the command line tool, at this point must list and delete EXIF meta
128
+          o An automated end-to-end test to show that the command line tool does properly remove the EXIF
129
+
130
+After this first step (making the skeleton) I should be able to deliver a working system right after adding each of the following features. I Hope to get feedbacks so can fix problems quickly
131
+
132
+    * 3 weeks
133
+          o adding support for (in order of importance) pdf, zip/tar/bzip (just the meta, not the content yet), jpeg/png/bmp, ogg/mpeg1-2-3, exe...
134
+          o For every type of meta, that involves :
135
+                + Creating some input test files with meta data
136
+                + Implementing the feature in the library
137
+                + Asserting that the lib does the job with unit tests
138
+                + Modifying the cmd line tool to support the feature (if necessary)
139
+                + Checking that the cmd line tool can properly delete this type of meta with automated end-to-end test
140
+
141
+    * about one day
142
+          o Enable the command line tool to set a specific meta to a chosen value
143
+
144
+    * about 1 day
145
+          o Implementation of the “batch mode” in the cmdline tool, to clean a whole folder
146
+          o Implementation of secure removal
147
+
148
+    * about 2 days :
149
+          o Add support for deep archive cleanup
150
+                + Clean the content of the archives
151
+                + Make a list of non supported format, for which we warn the user that only the container can be cleaned from meta, not the content (at first that will include rar, 7zip, ..)
152
+                + The supported formats  will be those  supported natively by  Python ( bzip2, gzip, tar )
153
+                + Create some test archives for each supported format containing various files with metas
154
+                + Implement the deep cleanup for the format
155
+                + Assert that the command line passes the end-to-end tests (that is, it can correctly clean the content of the test archives)
156
+
157
+    * about 2 days
158
+          o Add support for complete deletion of the original files
159
+          o Make a binding nice for shred (should not be to hard using Python)
160
+          o Implement the feature in the command line tool
161
+
162
+    * 3 weeks
163
+          o Implementation of the GUI tool
164
+          o At this stage, I can use the experience from implementing the cmdline tool to implement the GUI tool, having the same features.
165
+
166
+    * 1 week
167
+          o Add support for more format (might be based on requests from the community)
168
+
169
+    * Remaining weeks
170
+          o I want to keep those remaining week in case of problems, and for
171
+                + Remaining/polishing cleanup
172
+                + Bugfixing
173
+                + Integration work
174
+                + Missing features
175
+                + Packaging
176
+                + Final documentation
177
+    * Every Week-end :
178
+          o Documentation time : both end-user, and design. I do not like to document my code while I'm coding it : it slows a lot the development process, but it's not a good thing to delay it too much : week-ends seems fine for this.
179
+          o A blog-post, and a mail on the mailing list about what I have done in the week.
180
+
181
+About the anonymisation process :
182
+
183
+Since I'll relay on Hachoir in a first time, I don't know how much it's
184
+effective on every case.
185
+I am planing to do some tests during the Community Bonding Period.
186
+
187
+The plan is to first rely on the capabilities of Hachoir. I don't know yet
188
+how effective it is for each case. I am planing to do some tests during the
189
+Community Bonding Period. Following the test strategy described before, it
190
+will be easy to add a new document in the test set. If I could make a test
191
+file with metas not supported by Hachoir (or someone from the community
192
+provide such a file), we could then decide on the proper action : propose a
193
+patch to Hachoir or use another library for this specific format.
194
+
195
+Doing R&D about supporting exotics fields, or improve existing support will
196
+depend of my progress. Speaking of the fingerprints, the subject of the
197
+project is “metadata anonymisation”, and not “fingerprint detection” : they
198
+are too many softwares, too many subtle ways to alter/mark a file (and I
199
+don't even speak of steganography).
200
+
201
+So, nop, I'm not planing to implement fingerprinting detection. Not that it
202
+is not interesting, it's just that it's not realistic to support it
203
+correctly (actually it's not realistic to support it at all, given that I
204
+must first support the metas, in such a short time frame). Would you agree
205
+that it is better to first focus on making a working tool that does the job
206
+for the known metadatas ?
207
+
208
+That done, if the design is good we could easily add support for more exotic
209
+fields (or some kind of fingerprinting). I think we should never loose track
210
+of the sad truth : no matter how big the effort we spend on this, making a
211
+comprehensive tool able of detecting every kind of meta and every pattern of
212
+fingerprinting is just not feasible.
213
+
214
+About archives anonymisation
215
+
216
+Task order for anonymisation of an archive :
217
+
218
+         1. Extract the archive
219
+         2. Anonymise his content
220
+         3. Recompress the archive
221
+         4. Securely deleting the extracted content
222
+         5. Anonymise the new created archive
223
+
224
+And I'm planing to support the extraction only of archive formats supported by python (without this limitation, the tool will not be portable). If the user try to anonymize a .rar for example, the tool will popup “WARNING - only the container will be anonymize, and not the content - WARNING”
225
+
226
+As for what I expect from my mentor, I think he should try to be available (not immediately but in the 48 hours) when I need him specifically (e.g. technical questions no one else on IRC can answer) but he doesn't need to check on me all the time. I'm fine with mails to (since intrigeri is not always “online”, it will be I think, the best solution to communicate with him). I'd like to do a weekly reunion, on irc or jabber, to discuss things more smoothly.
227
+
228
+I'm using irc quite a lot, and I'm hanging on #tor, and #tor-dev (nick : jvoisin).
229
+I'm planing to do a blogpost every week-end, about the advancement of the project.
230
+You can mail me at : julien.voisin@dustri.org for more information.
231
+
0 232