Adding gsoc proposals for this year's students (7c4dd660f) - tor-webwml.git

about/gsocProposal/gsoc11-proposal-metadataToolkit.txt

...	...	@@ -0,0 +1,231 @@
	1	+Hello,
	2	+I am Julien Voisin, undergraduate computer science student from France.
	3	+
	4	+I am interested to work on the “Meta-data anonymizing toolkit for file publication” project.
	5	+I know there is already a student interested in by this project, but I really want to do it :
	6	+I needed it for my own and have already thought about a potential design some time ago.
	7	+
	8	+I would like to work for the EFF, because I am very concerned about privacy issues on the Internet.
	9	+I think privacy is an essential right, and not just an option.
	10	+Especially, I would really enjoy to work for the Tor project (or Tails, since it's heavily based on him).
	11	+I am using it for quite some time and would like to get more involved and contribute back!
	12	+
	13	+I use F/OSS on a daily basis (Ubuntu, Debian, Archlinux and Gentoo).
	14	+So far my major contributions were the writing of documentations for archLinux, openmw, xda-forum and Ubuntu.
	15	+Recently I have released a little matrix manipulation library written in C,
	16	+originally for an academic project (http://dustri.org/lib/).
	17	+I am interested to do the debian package, but heard that it can be quite tricky, so a little
	18	+help would be much appreciated
	19	+
	20	+I do not have any major plan for this summer (but my holidays only begins the june 4th), so I can fully focus on the project and reasonably think that I could commit 5-6 hours per day on it.
	21	+
	22	+Requirement/Deliverables:
	23	+
	24	+ * A command line and a GUI tool having both the following capabilities (in order of importance):
	25	+ o Listing the metadatas embedded in a given file
	26	+ o A batch mode to handle a whole directory (or set of directories)
	27	+ o The ability to scan files packed in the most common archive formats
	28	+ o A nice binding for srm (Secure ReMoval) or shred (GNU utils) to properly remove the original file containing the evil metas
	29	+ o Let the user delete/modify a specific meta
	30	+
	31	+ * Should run on the most common OS/architectures (And especially on Debian Squeeze, since Tails is based on it.)
	32	+ * The whole thing should be easily extensible (especially it should be easy to add support for new file formats)
	33	+ * The proper functioning of the software should be easily testable
	34	+
	35	+I'd like to do this project in Python, because I already have done some personal projects whith it (for which I also used subversion) : an IRC bot tailored for logging (dustri.org/tor/depluie.tar.bz2 still under heavy WIP), a battery monitor, a simple search engine indexing FTP servers, ...
	36	+
	37	+Why is Python a good choice for implementing this project ?
	38	+
	39	+ 1. I am experienced with the language
	40	+ 2. There are plenty of libraries to read/write metadatas, among them is Hachoir (https://bitbucket.org/haypo/hachoir/)that looks very promising since it supports quite a few file formats
	41	+ 3. It is easy to wrap other libraries for our needs (even if they are not written in Python !)
	42	+ 4. Runs on almost every OS/architecture, what is a great benefit for portability
	43	+ 5. It is easy to make unit tests (thanks to the built-in Unittest module)
	44	+
	45	+Proposed design:
	46	+
	47	+The proposed design has three main components : one lib, a command line tool and a GUI tool.
	48	+
	49	+The aim of the library (described with more details in the next part) is to make the development of tools easy. A special attention will be made on the API that it exposes. The ultimate goal being to be able to add the support of new file format in the library without changing the whole code of the tools.
	50	+
	51	+Meta reading/writing library :
	52	+
	53	+A library to read and write metas for various file formats. The main goal is to provide an abstraction interface (for the file format and for the underlying libraries used).
	54	+At first it would only wrap Hachoir.
	55	+Why hachoir :
	56	+
	57	+ * Autofix: Hachoir is able to open invalid / truncated files
	58	+ * Lazy: Open a file is very fast since no information is read from file, data are read and/or computed when the user ask for it
	59	+
	60	+ * Types: Hachoir has many predefined field types (integer, bit, string, etc.) and supports string with charset (ISO-8859-1, UTF-8, UTF-16, ...)
	61	+ * Addresses and sizes are stored in bit, so flags are stored as classic fields
	62	+ * Editor: Using Hachoir representation of data, you can edit, insert, remove data and then save in a new file.
	63	+ * Meta : Support a very large scale of file format
	64	+
	65	+But we could also wrap other libraries to support a particular file format. Or write ourself the support for a format, although this should be avoided if possible (it looks simple at first, but supporting different versions of the format and maintaining the thing over time is extremely time consuming)
	66	+The must would be to make the children libraries optional dependencies.
	67	+
	68	+One typical use case of the lib is to ask for metadatas for a file, if the format is supported a list (or maybe a tree) of metas is returned.
	69	+
	70	+Both the GUI and the cmdline tool will use this lib.
	71	+
	72	+The cmdline/GUI tool features:
	73	+
	74	+ * List all the meta
	75	+ * Removing all the meta
	76	+ * Anonymising all the meta
	77	+ * Let the user chose wich meta he wants to modify
	78	+ * Support archives anonymisation
	79	+ * Secure removal
	80	+ * Cleaning wholes folder recursively
	81	+
	82	+GUI:
	83	+Essentially the GUI tool would do the same features as for the cmd line too.
	84	+I do not have a significant GUI development experience, but I'm planing to fix that point during community bonding period.
	85	+
	86	+Proposed development methods:
	87	+One way to develop this would be to do it layer by layer : first implementing the meta reading/writing lib for ale the formats in one shot, then making the command line application, …
	88	+
	89	+However for this project, developing feature by feature seems more appropriate :
	90	+Starting by a skeleton implementing a thin slice of functionality that traverses most of the layers.
	91	+For example, I could start by focussing only on EXIF metas : make sure that the meta reading/writing library supports EXIF, then make the command line tool using the previous library.
	92	+And only then, when the skeleton is actually working, add supports for other features/formats.
	93	+
	94	+This allows a more incremental development flow and after only a few weeks I would be able to deliver a working system. The list of features supported in the first iterations would be ridiculously short but at least that would enable me to get feedbacks and notice quickly if I am on the wrong tracks.
	95	+Also, since I would add one feature at a time, the structure of the system will tend to easily accommodate that. Thanks to this, adding the support of a new file format will be made easy, even after the end of the GSOC.
	96	+
	97	+
	98	+Testing
	99	+For such a tool, since the smallest crack could compromise the user, testing is critically important. I plan to implement two main kind of testings :
	100	+
	101	+Unit tests
	102	+
	103	+To test the proper working of the meta accessing library. Maintaining a collection of files with metas for every format we should support. Using Python's unittest I can setup expectations to make sure that the library will finds the good fields.
	104	+
	105	+End-to-End testing:
	106	+
	107	+A script (or set of script) to test the proper working of the command line tool. One way is to run the tool in batch mode on the input test files set. If in the output we are still able to find metas, the system is not doing his job right.
	108	+
	109	+It is possible to write this script in Python, possibly using again the unittest lib (even if here the goal is to execute an external executable and see how it's behaving) to get output of the results that is consistent with the unit tests.
	110	+
	111	+The goal is to be able to automatically test the whole system in a few commands (one for the unit tests one for the end-to-end tests). It will be easy to add a new document in the test set, so if someone from the community provides a file that was not cleaned properly, we can easily reproduce the problem and then decide on the proper action to take. The ability to tests everything easily might help a user to make sure that everything works fine on his OS/architecture. Also, if for one reason after some months it is decided to change from Hachoir to another underlying library (yes a pretty radical decision, just given as example) we still have the tests, so valuable to check than everything is still working with the new library.
	112	+
	113	+Timeline:
	114	+
	115	+ * Community Bonding Period (in order of importance)
	116	+ o Playing around with pygobject
	117	+ o Playing with Hachoir
	118	+ o Learning git
	119	+
	120	+ * First two weeks :
	121	+ o create the structure in the repository (directories, README, ..)
	122	+ o Create a skeleton
	123	+
	124	+ o Objectives : to have a deployable working system as soon as possible(even if the list of features is ridiculous). So that I can show you my work in an incremental way thereafter and get feedbacks early.
	125	+ o The lib will handle reading/writing EXIF fields (using Hachoir)
	126	+ o A set of tests files (and automated unit tests) to demonstrate that the lib does the job
	127	+ o The beginning of the command line tool, at this point must list and delete EXIF meta
	128	+ o An automated end-to-end test to show that the command line tool does properly remove the EXIF
	129	+
	130	+After this first step (making the skeleton) I should be able to deliver a working system right after adding each of the following features. I Hope to get feedbacks so can fix problems quickly
	131	+
	132	+ * 3 weeks
	133	+ o adding support for (in order of importance) pdf, zip/tar/bzip (just the meta, not the content yet), jpeg/png/bmp, ogg/mpeg1-2-3, exe...
	134	+ o For every type of meta, that involves :
	135	+ + Creating some input test files with meta data
	136	+ + Implementing the feature in the library
	137	+ + Asserting that the lib does the job with unit tests
	138	+ + Modifying the cmd line tool to support the feature (if necessary)
	139	+ + Checking that the cmd line tool can properly delete this type of meta with automated end-to-end test
	140	+
	141	+ * about one day
	142	+ o Enable the command line tool to set a specific meta to a chosen value
	143	+
	144	+ * about 1 day
	145	+ o Implementation of the “batch mode” in the cmdline tool, to clean a whole folder
	146	+ o Implementation of secure removal
	147	+
	148	+ * about 2 days :
	149	+ o Add support for deep archive cleanup
	150	+ + Clean the content of the archives
	151	+ + Make a list of non supported format, for which we warn the user that only the container can be cleaned from meta, not the content (at first that will include rar, 7zip, ..)
	152	+ + The supported formats will be those supported natively by Python ( bzip2, gzip, tar )
	153	+ + Create some test archives for each supported format containing various files with metas
	154	+ + Implement the deep cleanup for the format
	155	+ + Assert that the command line passes the end-to-end tests (that is, it can correctly clean the content of the test archives)
	156	+
	157	+ * about 2 days
	158	+ o Add support for complete deletion of the original files
	159	+ o Make a binding nice for shred (should not be to hard using Python)
	160	+ o Implement the feature in the command line tool
	161	+
	162	+ * 3 weeks
	163	+ o Implementation of the GUI tool
	164	+ o At this stage, I can use the experience from implementing the cmdline tool to implement the GUI tool, having the same features.
	165	+
	166	+ * 1 week
	167	+ o Add support for more format (might be based on requests from the community)
	168	+
	169	+ * Remaining weeks
	170	+ o I want to keep those remaining week in case of problems, and for
	171	+ + Remaining/polishing cleanup
	172	+ + Bugfixing
	173	+ + Integration work
	174	+ + Missing features
	175	+ + Packaging
	176	+ + Final documentation
	177	+ * Every Week-end :
	178	+ o Documentation time : both end-user, and design. I do not like to document my code while I'm coding it : it slows a lot the development process, but it's not a good thing to delay it too much : week-ends seems fine for this.
	179	+ o A blog-post, and a mail on the mailing list about what I have done in the week.
	180	+
	181	+About the anonymisation process :
	182	+
	183	+Since I'll relay on Hachoir in a first time, I don't know how much it's
	184	+effective on every case.
	185	+I am planing to do some tests during the Community Bonding Period.
	186	+
	187	+The plan is to first rely on the capabilities of Hachoir. I don't know yet
	188	+how effective it is for each case. I am planing to do some tests during the
	189	+Community Bonding Period. Following the test strategy described before, it
	190	+will be easy to add a new document in the test set. If I could make a test
	191	+file with metas not supported by Hachoir (or someone from the community
	192	+provide such a file), we could then decide on the proper action : propose a
	193	+patch to Hachoir or use another library for this specific format.
	194	+
	195	+Doing R&D about supporting exotics fields, or improve existing support will
	196	+depend of my progress. Speaking of the fingerprints, the subject of the
	197	+project is “metadata anonymisation”, and not “fingerprint detection” : they
	198	+are too many softwares, too many subtle ways to alter/mark a file (and I
	199	+don't even speak of steganography).
	200	+
	201	+So, nop, I'm not planing to implement fingerprinting detection. Not that it
	202	+is not interesting, it's just that it's not realistic to support it
	203	+correctly (actually it's not realistic to support it at all, given that I
	204	+must first support the metas, in such a short time frame). Would you agree
	205	+that it is better to first focus on making a working tool that does the job
	206	+for the known metadatas ?
	207	+
	208	+That done, if the design is good we could easily add support for more exotic
	209	+fields (or some kind of fingerprinting). I think we should never loose track
	210	+of the sad truth : no matter how big the effort we spend on this, making a
	211	+comprehensive tool able of detecting every kind of meta and every pattern of
	212	+fingerprinting is just not feasible.
	213	+
	214	+About archives anonymisation
	215	+
	216	+Task order for anonymisation of an archive :
	217	+
	218	+ 1. Extract the archive
	219	+ 2. Anonymise his content
	220	+ 3. Recompress the archive
	221	+ 4. Securely deleting the extracted content
	222	+ 5. Anonymise the new created archive
	223	+
	224	+And I'm planing to support the extraction only of archive formats supported by python (without this limitation, the tool will not be portable). If the user try to anonymize a .rar for example, the tool will popup “WARNING - only the container will be anonymize, and not the content - WARNING”
	225	+
	226	+As for what I expect from my mentor, I think he should try to be available (not immediately but in the 48 hours) when I need him specifically (e.g. technical questions no one else on IRC can answer) but he doesn't need to check on me all the time. I'm fine with mails to (since intrigeri is not always “online”, it will be I think, the best solution to communicate with him). I'd like to do a weekly reunion, on irc or jabber, to discuss things more smoothly.
	227	+
	228	+I'm using irc quite a lot, and I'm hanging on #tor, and #tor-dev (nick : jvoisin).
	229	+I'm planing to do a blogpost every week-end, about the advancement of the project.
	230	+You can mail me at : julien.voisin@dustri.org for more information.
	231	+
0	232

...	...	@@ -212,6 +212,10 @@
212	212	<li><h4><a href="http://tor.spanning-tree.org/proposal.html">DNSEL Rewrite</a> by Harry Bock</h4></li>
213	213	<li><h4><a href="http://kjb.homeunix.com/wp-content/uploads/2010/05/KevinBerry-GSoC2010-TorProposal.html">Extending Tor Network Metrics</a> by Kevin Berry</h4></li>
214	214	<li><h4><a href="../about/gsocProposal/gsoc10-proposal-soat.txt">SOAT Expansion</a> by John Schanck</h4></li>
	215	+ <li><h4><a href="http://inspirated.com/uploads/tor-gsoc-11.pdf">GTK+ Frontend and Client Mode Improvements for arm</a> by Kamran Khan</h4></li>
	216	+ <li><h4><a href="http://www.gsathya.in/gsoc11.html">Orbot + ORLib</a> by Sathya Gunasekaran</h4></li>
	217	+ <li><h4><a href="http://blanu.net/TorSummerOfCodeProposal.pdf">Blocking-resistant Transport Evaluation Framework</a> by Brandon Wiley</h4></li>
	218	+ <li><h4><a href="../about/gsocProposal/gsoc11-proposal-metadataToolkit.txt">Metadata Anonymisation Toolkit</a> by Julien Voisin</h4></li>
215	219	<li><h4><a href="http://www.atagar.com/misc/gsocBlog09/">Website Pootle Translation</a> by Damian Johnson</h4></li>
216	220	</ul>
217	221