Storing copies of past gsoc proposals (9be75c2bc) - tor-webwml.git

Storing copies of past gsoc proposals

Damian Johnson commited on 2011-05-14 21:14:00
Zeige 7 geänderte Dateien mit 451 Einfügungen und 232 Löschungen.

Suggested by Roger in case the originals vanish. These are pdf printoffs
with the exception of hbock's DNSEL rewrite app, which didn't translate well.
This was a simple page, so just copying it.

I'm also replacing a simple txt copy of jvoisin's metadata toolkit proposal
with a much better looking pdf of his melange app (restricted to just the
content via firebug).

about/en/gsoc.wml a93e193c0..1fb06160d
about/gsocProposal/gsoc10-proposal-dnselRewrite.html 000000000..a8444c6c8
about/gsocProposal/gsoc10-proposal-metrics.pdf 000000000..a35ba98d6
about/gsocProposal/gsoc11-proposal-armGtkFrontend.pdf 000000000..de7701407
about/gsocProposal/gsoc11-proposal-blockingResistance.pdf 000000000..784e4021a
about/gsocProposal/gsoc11-proposal-metadataToolkit.pdf 000000000..e09ba490e
about/gsocProposal/gsoc11-proposal-metadataToolkit.txt 37aa75012..000000000

about/en/gsoc.wml

Zeige Datei @ 9be75c2bc

@@ -220,7 +220,7 @@
       <li><h4><a href="http://inspirated.com/uploads/tor-gsoc-11.pdf">GTK+ Frontend and Client Mode Improvements for arm</a> by Kamran Khan</h4></li>
       <li><h4><a href="http://www.gsathya.in/gsoc11.html">Orbot + ORLib</a> by Sathya Gunasekaran</h4></li>
       <li><h4><a href="http://blanu.net/TorSummerOfCodeProposal.pdf">Blocking-resistant Transport Evaluation Framework</a> by Brandon Wiley</h4></li>
-      <li><h4><a href="../about/gsocProposal/gsoc11-proposal-metadataToolkit.txt">Metadata Anonymisation Toolkit</a> by Julien Voisin</h4></li>
+      <li><h4><a href="../about/gsocProposal/gsoc11-proposal-metadataToolkit.pdf">Metadata Anonymisation Toolkit</a> by Julien Voisin</h4></li>
       <li><h4><a href="http://www.atagar.com/misc/gsocBlog09/">Website Pootle Translation</a> by Damian Johnson</h4></li>
     </ul>
     

about/gsocProposal/gsoc10-proposal-dnselRewrite.html

Zeige Datei @ 9be75c2bc

...	...	@@ -0,0 +1,450 @@
	1	+<html><head>
	2	+<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
	3	+ <title>GSoC Application</title>
	4	+ <style type="text/css">
	5	+ body {
	6	+ width:60%;
	7	+ text-align:justify;
	8	+ }
	9	+ </style>
	10	+ </head><body>
	11	+<h2><a href="http://torproject.org/">Tor Project</a> - DNSEL Rewrite</h2>
	12	+<b>Google Summer of Code Student Application</b> <br>
	13	+Harry Bock <hbock AT ele DOT uri DOT edu>
	14	+<blockquote>
	15	+<h2>Abstract:</h2>
	16	+
	17	+<p>
	18	+ The TorDNSEL project is concerned with identifying individual hosts
	19	+ as valid and accessible Tor exit relays. Each Tor exit relay has an
	20	+ associated exit policy governing what traffic may leave the Tor
	21	+ circuit and go out as requests to the internet. A public database
	22	+ that can be easily queried or scraped would be of huge benefit to
	23	+ the Tor community and to services that are interested in whether
	24	+ clients originate from the Tor network, such as Wikipedia and IRC
	25	+ networks.
	26	+</p>
	27	+</blockquote>
	28	+
	29	+<ol>
	30	+<li>
	31	+ <strong>
	32	+ What project would you like to work on? Use our ideas lists as a
	33	+ starting point or make up your own idea. Your proposal should
	34	+ include high-level descriptions of what you're going to do, with
	35	+ more details about the parts you expect to be tricky. Your
	36	+ proposal should also try to break down the project into tasks of a
	37	+ fairly fine granularity, and convince us you have a plan for
	38	+ finishing it.
	39	+ </strong>
	40	+ <p>
	41	+ My primary interest is in the TorDNSEL rewrite. Currently
	42	+ unmaintained and written in Haskell, I would like to rework it
	43	+ from the ground up using Python and the Torflow interface.
	44	+ </p>
	45	+ <ul>
	46	+ <li>
	47	+ Prior to actually building the new TorDNSEL, some time must be
	48	+ spent researching and testing strategies for identifying and verifying
	49	+ recent Tor exit relays.
	50	+ <ul>
	51	+ <li>
	52	+ Much of this information is available straight from the
	53	+ cached-descriptors file distributed to running Tor relays, but
	54	+ not all of it is accurate or up-to-date.
	55	+ </li>
	56	+ <li>
	57	+ <p>
	58	+ To ensure the honesty of advertised exit nodes, the program
	59	+ must actively build circuits over the Tor network to the
	60	+ intended exit node and verify the IP address and exit policies
	61	+ listed in the cached router list. This will be accomplished via
	62	+ TorFlow, most likely using the NetworkScanners tool SoaT. If
	63	+ more detailed checking is required than provided by SoaT, it will
	64	+ be modified and extended to suit the new requirements.
	65	+ </p>
	66	+ </li>
	67	+ <li>
	68	+ <p>
	69	+ Since new relays entering the Tor network are almost
	70	+ immediately available for use, it is important that new
	71	+ relays are checked and added as quickly as possible.
	72	+ Testing ExitPolicy honesty may be time-consuming for
	73	+ certain relays, destinations, and services.
	74	+ </p>
	75	+ <p>
	76	+ To improve exit honesty checking latency, hosts that have
	77	+ complex exit policies may be checked incrementally; for
	78	+ example, if popular services such as http/https/domain/ssh
	79	+ are allowed by the relay's ExitPolicy, these should be
	80	+ checked first and the relay can be marked as "honest
	81	+ (preliminary)", so that partial results may be listed
	82	+ earlier pending more thorough circuit testing.
	83	+ </p>
	84	+ </li>
	85	+ </ul>
	86	+ </li><li>
	87	+ Once the desired method for compiling and verifying exit relays
	88	+ is tested, a formal design specification for the following operations
	89	+ must be compiled:
	90	+ <ol>
	91	+ <li><p>Scraping of all known exit relays and their exit
	92	+ policies.</p> As noted in the Tor dir-spec v3, in the future
	93	+ it will not scale to have every Tor client/directory cache know
	94	+ the IP of every other router. We need to be able to accurately
	95	+ obtain this data, up-to-date and in bulk, from an authoritative
	96	+ source. The DNSEL should be able to verify all exit relays,
	97	+ in the shortest time span allowable.
	98	+ </li>
	99	+ <li>
	100	+ Active testing for valid exit IP addresses and policies.
	101	+ </li>
	102	+ <li>
	103	+ Parameters and raw format of the resulting data.
	104	+ </li>
	105	+ <li>
	106	+ Formal mechanisms and formats for access by consumers of this
	107	+ data. As a minimal starting point, SQLite and JSON should be
	108	+ supported for build data sets, as both are well-structured,
	109	+ standardised formats that have cross-platform open-source
	110	+ support for the most widely used programming languages. One this
	111	+ data is readily available, it would be extremely useful to provide
	112	+ a simple API in Python to improve ease-of-use and integration of
	113	+ this data with existing services.
	114	+ </li>
	115	+ </ol>
	116	+ </li>
	117	+ <li>
	118	+ <p>
	119	+ Currently, users can check the TorBulkExitList on
	120	+ check.torproject.org or perform DNS queries against
	121	+ exitlist.torproject.org, but this is not ideal for all consumers
	122	+ of this data; currently it is expensive to perform these queries
	123	+ and they must be done one at a time over the network. While this
	124	+ is less of an issue for services that need to perform infrequent
	125	+ exit checks, I propose that this mechanism can actually harm
	126	+ anonymity, as an adversary that can track these queries (those
	127	+ with access to the network of the querying service, or those
	128	+ running rogue DNSEL implementations, should it become more
	129	+ distributed) can determine, for example, what exit nodes are
	130	+ currently serving a large amount of IRC connections. This is
	131	+ more important for services that do frequent queries on a wide
	132	+ range of IP addresses. By encouraging these queries to be done
	133	+ locally, we can improve network latency, throughput, and
	134	+ anonymity together.
	135	+ </p>
	136	+ <p>
	137	+ On the other hand, for some services, requiring services to
	138	+ download the entire exit set when they only need to query a few
	139	+ addresses daily would be a waste of valuable resources on both
	140	+ ends. Thus, both single-exit queries and bulk exit description
	141	+ lists must be provided; raw queries would be used by services
	142	+ that are simply doing manual checks of possible Tor addresses,
	143	+ or infrequent automated checks.
	144	+ </p>
	145	+ <p>
	146	+ The primary consumers of bulk data are services that need
	147	+ to do frequent automated checks and benefit strongly from
	148	+ local caching of data. Some prominent examples would be:
	149	+ </p><ul>
	150	+ <li>The Tor project itself (see TorBulkExitList.py,
	151	+ check.torproject.org, etc.).</li>
	152	+ <li>IRC networks such as Freenode and OFTC that
	153	+ automatically scrub any potentially identifiable
	154	+ information from WHOIS queries.</li>
	155	+ <li>Social networks or collaborative communities such as
	156	+ Wikipedia that would benefit from knowing if a
	157	+ particular IP address is shared, allowing them to
	158	+ (possibly) be more lenient towards abuse from
	159	+ anonymizing services.
	160	+ </li>
	161	+ </ul>
	162	+ <p></p>
	163	+ <p>
	164	+ A particularly important goal is to ensure that Tor users
	165	+ that also happen to run Tor relays are not automatically
	166	+ blocked by services simply because they are relays or exit
	167	+ nodes. If a service is able to ascertain that an IP address
	168	+ corresponds to a Tor relay, but its exit policy would not
	169	+ allow traffic to access the service from Tor, ideally it
	170	+ should not block access to its resources. The cheaper and
	171	+ easier it is for a service operator to validate this kind of
	172	+ information, the more likely the service is to use it and
	173	+ collaborate with the Tor community. The more the service is
	174	+ used, the more likely we are to get feedback, whether
	175	+ it be able the data format, false positives or negatives,
	176	+ invalid/incorrect exit policies, etc.
	177	+ </p>
	178	+ </li>
	179	+ <li>
	180	+ The TorDNSEL project itself will likely benefit from being broken
	181	+ into several components that interact with each other and operate on
	182	+ exit list data sets.
	183	+ <ul>
	184	+ <li>Data scraper and honesty checking daemon: constantly runs in
	185	+ the background, fetching new information about Tor relays and
	186	+ their exit policies and updating the raw journal. As new relays
	187	+ come in, they are accessed and verified for honesty and
	188	+ correctness. As relay data becomes stale according to a
	189	+ configurable TTL, they are re-checked.
	190	+ </li>
	191	+ <li>
	192	+ Data compiler: Manages, updates, and merges exit lists. Can
	193	+ import and export data from bulk formats and provide
	194	+ statistics about data sets.
	195	+ </li>
	196	+ <li>
	197	+ Single-query engine: Allows applications and external services
	198	+ to query about a single IP address and exit policy at a time.
	199	+ Can interface via several convenient methods, starting out
	200	+ simple (e.g., Python API requests or HTTP requests) and
	201	+ eventually working up to RBL-style DNS queries, like the
	202	+ original DNSEL, should time permit.
	203	+ </li>
	204	+ <li>
	205	+ Data distribution: How do clients actually access bulk data?
	206	+ This doesn't necessarily have to be its own application, but
	207	+ it must be feasible for clients to easily scrape and update
	208	+ their local caches. The data may be served by a static web
	209	+ server, like lighttpd, or via other mechanisms (sftp, or dns
	210	+ zone transfer as a more convoluted example).
	211	+ </li>
	212	+ </ul>
	213	+ </li>
	214	+ <li>
	215	+ As an aside, there are several open and accepted Tor proposals
	216	+ that are relevant to the work I will complete:
	217	+ <ul>
	218	+ <li>140 <b>Provide diffs between consensuses</b> (ACCEPTED)</li>
	219	+ <li>146 <b>Add new flag to reflect long-term stability</b> (OPEN)</li>
	220	+ <li>147 <b>Eliminate the need for v2 directories in generating v3 directories</b> (ACCEPTED)</li>
	221	+ <li>159 <b>Exit Scanning</b> (OPEN)</li>
	222	+ </ul>
	223	+ </li>
	224	+
	225	+ <!--
	226	+ <li>
	227	+ From what I have researched and discussed so far, the tricky parts
	228	+ will be verifying that listed exit nodes and their exit policies
	229	+ are correct (as advertised) and presenting this data in a useful
	230	+ way to dnsel consumers. It was originally assumed that consumers
	231	+ of dnsel records would want to use a DNS RBL-style interface, and
	232	+ while this interface has been useful to (for example) IRC server
	233	+ operators, these direct queries end up revealing more about the
	234	+ usage of exit nodes than is necessary.
	235	+ </li>
	236	+ -->
	237	+</ul>
	238	+<p></p>
	239	+<b>Estimated timeline:</b>
	240	+<p>For each weekly milestone, appropriate documentation should be written to
	241	+coincide with the completed work. For example, end-user tools should have
	242	+manual pages at the very least, and preferably include a LaTeX manual. Milestones
	243	+that are primarily experimental in nature should include complete descriptions and
	244	+proposals in plain-text where appropriate. All source code will be thoroughly
	245	+commented and include documentation useful to developers.
	246	+</p>
	247	+<p>
	248	+ <b>April 26 - May 24 (Pre-SoC)</b>: Get up to speed with Tor
	249	+ directory and caching architecture, pick apart existing Haskell
	250	+ implementation of TorDNSEL, and master TorFlow.
	251	+</p>
	252	+<p>
	253	+ <b>May 31 (end of week 1)</b>: Have a working mechanism for
	254	+ compiling as much testable information about exit relays as
	255	+ possible. This data must be easily accessible for subsequent work.
	256	+ This may be taken, adapted, or abstracted from existing data directory crawling
	257	+ in TorFlow.
	258	+</p>
	259	+<p>
	260	+ <b>June 14 (week 3)</b>: Working implementation of tests using TorFlow, especially
	261	+ ExitAuthority tools. This will probably be the most time-consuming period; may take
	262	+ up to a week more than anticipated.
	263	+</p>
	264	+<p>
	265	+ <b>June 21 (week 4)</b>: Be able to produce consistent, constantly updating exit lists
	266	+ with tested and untested exit policies listed. Find Tor developer guinea pigs to
	267	+ test and hunt for glaring holes in exit relay honesty testing and verification. :)
	268	+</p>
	269	+<p>
	270	+ <b>June 28 (week 5)</b>: Begin proof-of-concept production of bulk
	271	+ data formats (raw, SQLite and JSON), all of which should be similar
	272	+ in format. Consultations should be made with consumers of such data
	273	+ (Freenode, Wikipedia, etc.) to ensure the current data presentation
	274	+ is not overreaching or missing information that would be useful to
	275	+ them.
	276	+</p>
	277	+<p>
	278	+ <b>July 12 (week 7)</b>: Integrate existing functionality and data access methods into
	279	+ a Python API that is usable for consumers and the DNSEL application itself. Style should
	280	+ be similar to TorFlow where possible.
	281	+</p><p>
	282	+ <b>July 16 (midterm evals):</b> Completed specifications of TorDNSEL
	283	+ operation, basic data formats, and delivery methods. Completed
	284	+ first proof-of-concept implementation. First major review with
	285	+ mentors and as much of the Tor developer community as time permits.
	286	+</p>
	287	+<p>
	288	+ <b>July 19 (week 8)</b>: Work on designing and testing exit list cache update mechanisms.
	289	+ Start with something similar to cached-descriptors.new journaling, and work up
	290	+ to something for useful for other data formats. Integrate mechanism into API.
	291	+</p>
	292	+<p>
	293	+ <b>July 26 (week 9)</b>: Solidify main scrape/check application and
	294	+ perform as much real-world testing as time permits, adjusting for
	295	+ major setbacks, if any.
	296	+</p>
	297	+<p>
	298	+ <b>August 2 (week 10)</b>: Make adjustments based on feedback from
	299	+ (hopefully) several real-world consumers of TorDNSEL data.
	300	+ Generally polish and improve usability of core application(s).
	301	+</p>
	302	+<p>
	303	+ <b>August 9 ("pencils down"):</b>: Start pumping out documentation and comprehensive
	304	+ code and review.
	305	+</p>
	306	+<p>
	307	+ <b>August 16 ("okay really, pencils down"):</b>Major remaining kinks
	308	+ should be ironed out; polish specification and documentation and
	309	+ begin writing final evaluations. Plan for future maintenance of
	310	+ TorDNSEL.
	311	+</p>
	312	+</li>
	313	+
	314	+<li>
	315	+ <strong> Point us to a code sample: something good and clean to
	316	+ demonstrate that you know what you're doing, ideally from an
	317	+ existing project.
	318	+ </strong>
	319	+ <p>
	320	+ Code from almost any project I've worked on is available at
	321	+ http://git.spanning-tree.org/. Some of my better code:
	322	+ </p><ul>
	323	+ <li><a href="http://git.spanning-tree.org/index.cgi/grizzlor/">libgrizzlor</a> [C, Common Lisp], an abstraction layer for the
	324	+ SILC client library focused on bots.
	325	+ </li>
	326	+ <li><a href="http://git.spanning-tree.org/index.cgi/rigel/tree/">rigel</a> [C], a UNIX PIC16/PIC18 program loader for use with the FIRST robotics competition.</li>
	327	+ <li><a href="http://git.spanning-tree.org/index.cgi/nis/">Network Subsystem Inventory</a> [Python], a Django web application for keeping an inventory of network resources on a large university network. </li>
	328	+ <li><a href="http://git.spanning-tree.org/index.cgi/Periscope/">Periscope</a>
	329	+ [Common Lisp], a network monitoring application inspired by IP-Audit,
	330	+designed from the ground up to work with the Argus netflow application.</li>
	331	+ </ul>
	332	+ <p></p>
	333	+</li>
	334	+<li><strong> Why do you want to work with The Tor Project / EFF in particular?</strong>
	335	+<p>
	336	+The Tor Project interests me primarily from architectural and
	337	+information security perspectives; my primary focus in information
	338	+security has always been authentication and authorization - verifying
	339	+the identity of a user to explicitly or implicitly control access to
	340	+machine and network resources. The goal of all forms of public-key
	341	+and secure hash cryptography is the authentication of a third party or
	342	+data, essentially pinning their identity down.
	343	+</p>
	344	+<p>
	345	+Tor greatly interests me because it has the opposite goal; it tries to
	346	+ensure that pinning down the identity of any particular user is
	347	+(ideally) impossible or at least greatly hindered for any non-global
	348	+adversary. Protecting the rights of network users by preserving their
	349	+anonymity is an incredibly important and complicated goal, and Tor's
	350	+role in increasing anonymity of internet access in the face of many
	351	+types of adversaries is extremely valuable. To this end, I hope that
	352	+my contributions will be found useful by the Tor project, its users,
	353	+and those working to protect these end users.
	354	+</p>
	355	+</li>
	356	+<li>
	357	+ <strong>
	358	+ Tell us about your experiences in free software development
	359	+ environments. We especially want to hear examples of how you have
	360	+ collaborated with others rather than just working on a project by
	361	+ yourself.
	362	+ </strong>
	363	+ <p>While nearly all of the projects I've worked on have been free
	364	+ software, my experience working directly with the free software
	365	+ community at large is minimal. I have contributed briefly to the
	366	+ KDE project, working on their display configuration application,
	367	+ and submitted patches to other open source projects (QoSient's
	368	+ Argus netflow tools and Google's ipaddr-py, for example). I have
	369	+ collaborated with various universities in New England on
	370	+ development of the Nautilus project (http://nautilus.oshean.org/)
	371	+ and its main subproject, Periscope
	372	+ (http://nautilus.oshean.org/wiki/Periscope), while working at the
	373	+ OSHEAN non-profit consortium. </p>
	374	+ <p>
	375	+I sincerely look forward to working with the vibrant development
	376	+community of the Tor project and hope to gain more experience in
	377	+collaborating with an experienced group of developers.
	378	+</p>
	379	+</li>
	380	+
	381	+<li>
	382	+ <strong>
	383	+ Will you be working full-time on the project for the summer, or
	384	+ will you have other commitments too (a second job, classes, etc)?
	385	+ If you won't be available full-time, please explain, and list
	386	+ timing if you know them for other major deadlines
	387	+ (e.g. exams). Having other activities isn't a deal-breaker, but we
	388	+ don't want to be surprised.
	389	+ </strong>
	390	+ <p>
	391	+ I will be working part-time at the University of Rhode Island
	392	+ Information Security Office, and will have one summer class for five
	393	+ weeks starting in late May. I don't anticipate either will
	394	+ significantly affect my involvement with the Tor project.
	395	+ </p>
	396	+</li>
	397	+<li>
	398	+ <strong>
	399	+ Will your project need more work and/or maintenance after the
	400	+ summer ends? What are the chances you will stick around and help
	401	+ out with that and other related projects?
	402	+ </strong>
	403	+<p>
	404	+While I am confident I can produce a working initial implementation of
	405	+dnsel in the time allotted, I anticipate it will need more work at the
	406	+end of summer. One of my primary goals for the dnsel project is to
	407	+make it easier to maintain, as its operation will have to be adjusted
	408	+to fit with changes in the Tor architecture. Making the project more
	409	+accessible to other maintainers will allow for greater collaboration
	410	+and improvements to dnsel where development on the current
	411	+implementation has stagnated.
	412	+</p>
	413	+</li>
	414	+<li>
	415	+ <strong>
	416	+ What is your ideal approach to keeping everybody informed of your
	417	+ progress, problems, and questions over the course of the project? Said
	418	+ another way, how much of a "manager" will you need your mentor to be?
	419	+ </strong>
	420	+ <p>
	421	+ I will do my best to communicate with my mentors and the Tor developer
	422	+ community at large as frequently and directly as possible, via
	423	+ #tor-dev and the mailing lists. I also hope to inform others of more
	424	+ major milestones in the project via a blog or web page, and keep
	425	+ detailed documentation and progress updates on the Tor wiki.
	426	+ </p>
	427	+</li><li>
	428	+ <strong>What school are you attending? What year are you, and what's
	429	+your major/degree/focus? If you're part of a research group, which one?</strong>
	430	+ <p>
	431	+ I am currently attending the University of Rhode Island. This is my
	432	+ fourth year in college and second at URI; I am a Computer Engineering
	433	+major, intending to graduate next year and obtain my masters degree the
	434	+following year. My primary interests are low-level software development
	435	+ and systems programming, networking, information security, and signal
	436	+processing.
	437	+ </p>
	438	+</li>
	439	+<li>
	440	+ <strong>How can we contact you to ask you further questions? Google
	441	+doesn't share your contact details with us automatically, so you should
	442	+include that in your application. In addition, what's your IRC nickname?
	443	+ Interacting with us on IRC will help us get to know you, and help you
	444	+get to know our community.</strong>
	445	+ <p>
	446	+ You can contact me at hbock@ele.uri.edu; my nickname on IRC is <b>hbock</b>.
	447	+ </p>
	448	+</li>
	449	+</ol>
	450	+</body></html>
0	451	\ No newline at end of file

about/gsocProposal/gsoc10-proposal-metrics.pdf

Zeige Datei @ 9be75c2bc

about/gsocProposal/gsoc11-proposal-armGtkFrontend.pdf

Zeige Datei @ 9be75c2bc

about/gsocProposal/gsoc11-proposal-blockingResistance.pdf

Zeige Datei @ 9be75c2bc

about/gsocProposal/gsoc11-proposal-metadataToolkit.pdf

Zeige Datei @ 9be75c2bc

about/gsocProposal/gsoc11-proposal-metadataToolkit.txt

Zeige Datei @ e347212d1

@@ -1,231 +0,0 @@
                         -Hello,
                         -I am Julien Voisin, undergraduate computer science student from France.
+                        -
                         -I am interested to work on the “Meta-data anonymizing toolkit for file publication” project.
                         -I know there is already a student interested in by this project, but I really want to do it :
                         -I needed it for my own and have already thought about a potential design some time ago.
+                        -
                         -I would like to work for the EFF, because I am very concerned about privacy issues on the Internet.
                         -I think privacy is an essential right, and not just an option.
                         -Especially, I would really enjoy to work for the Tor project (or Tails, since it's heavily based on him).
                         -I am using it for quite some time and would like to get more involved and contribute back!
+                        -
                         -I use F/OSS on a daily basis (Ubuntu, Debian, Archlinux and Gentoo).
                         -So far my major contributions were the writing of documentations for archLinux, openmw, xda-forum and Ubuntu.
                         -Recently I have released a little matrix manipulation library written in C,
                         -originally for an academic project (http://dustri.org/lib/).
                         -I am interested to do the debian package, but heard that it can be quite tricky, so a little
                         -help would be much appreciated
+                        -
                         -I do not have any major plan for this summer (but my holidays only begins the june 4th), so I can fully focus on the project and reasonably think that I could commit 5-6 hours per day on it.
+                        -
                         -Requirement/Deliverables:
+                        -
                         -    * A command line and a GUI tool having both the following capabilities (in order of importance):
                         -          o Listing the metadatas embedded in a given file
                         -          o A batch mode to handle a whole directory (or set of directories)
                         -          o The ability to scan files packed in the most common archive formats
                         -          o A nice binding for srm (Secure ReMoval) or shred (GNU utils) to properly remove the original file containing the evil metas
                         -          o Let the user delete/modify a specific meta
+                        -
                         -    * Should run on the most common OS/architectures (And especially on Debian Squeeze, since Tails is based on it.)
                         -    * The whole thing should be easily extensible (especially it should be easy to add support for new file formats)
                         -    * The proper functioning of the software should be easily testable
+                        -
                         -I'd like to do this project in Python, because I already have done some personal projects whith it (for which I also used subversion) : an IRC bot tailored for logging  (dustri.org/tor/depluie.tar.bz2 still under heavy WIP), a battery monitor, a simple search engine indexing FTP servers, ...
+                        -
                         -Why is Python a good choice for implementing this project ?
+                        -
                         -   1. I am experienced with the language
                         -   2. There are plenty of libraries to read/write metadatas, among them is Hachoir (https://bitbucket.org/haypo/hachoir/)that looks very promising since it supports quite a few file formats
                         -   3. It is easy to wrap other libraries for our needs (even if they are not written in Python !)
                         -   4. Runs on almost every OS/architecture, what is a great benefit for portability
                         -   5. It is easy to make unit tests (thanks to the built-in Unittest module)
+                        -
                         -Proposed design:
+                        -
                         -The proposed design has three main components : one lib, a command line tool and a GUI tool.
+                        -
                         -The aim of the library (described with more details in the next part) is to make the development of tools easy. A special attention will be made on the API that it exposes. The ultimate goal being to be able to add the support of new file format in the library without changing the whole code of the tools.
+                        -
                         -Meta reading/writing library :
+                        -
                         -A library to read and write metas for various file formats. The main goal is to provide an abstraction interface (for the file format and for the underlying libraries used).
                         -At first it would only wrap Hachoir.
                         -Why hachoir :
+                        -
                         -    * Autofix: Hachoir is able to open invalid / truncated files
                         -    * Lazy: Open a file is very fast since no information is read from file, data are read and/or computed when the user ask for it
+                        -
                         -    * Types: Hachoir has many predefined field types (integer, bit, string, etc.) and supports string with charset (ISO-8859-1, UTF-8, UTF-16, ...)
                         -    * Addresses and sizes are stored in bit, so flags are stored as classic fields
                         -    * Editor: Using Hachoir representation of data, you can edit, insert, remove data and then save in a new file.
                         -    * Meta : Support a very large scale of file format
+                        -
                         -But we could also wrap other libraries to support a particular file format. Or write ourself the support for a format, although this should be avoided if possible (it looks simple at first, but supporting different versions of the format and maintaining the thing over time is extremely time consuming)
                         -The must would be to make the children libraries optional dependencies.
+                        -
                         -One typical use case of the lib is to ask for metadatas for a file, if the format is supported a list (or maybe a tree) of metas is returned.
+                        -
                         -Both the GUI and the cmdline tool will use this lib.
+                        -
                         -The cmdline/GUI tool features:
+                        -
                         -    * List all the meta
                         -    * Removing all the meta
                         -    * Anonymising all the meta
                         -    * Let the user chose wich meta he wants to modify
                         -    * Support archives anonymisation
                         -    * Secure removal
                         -    * Cleaning wholes folder recursively
+                        -
                         -GUI:
                         -Essentially the GUI tool would do the same features as for the cmd line too.
                         -I do not have a significant GUI development experience, but I'm planing to fix that point during community bonding period.
+                        -
                         -Proposed development methods:
                         -One way to develop this would be to do it layer by layer : first implementing the meta reading/writing lib for ale the formats in one shot, then making the command line application, …
+                        -
                         -However for this project, developing feature by feature seems more appropriate :
                         -Starting by a skeleton implementing a thin slice of functionality that traverses most of the layers.
                         -For example, I could start by focussing only on EXIF metas : make sure that the meta reading/writing library supports EXIF, then make the command line tool using the previous library.
                         -And only then, when the skeleton is actually working, add supports for other features/formats.
+                        -
                         -This allows a more incremental development flow and after only a few weeks I would be able to deliver a working system. The list of features supported in the first iterations would be ridiculously short but at least that would enable me to get feedbacks and notice quickly if I am on the wrong tracks.
                         -Also, since I would add one feature at a time, the structure of the system will tend to easily accommodate that. Thanks to this, adding the support of a new file format will be made easy, even after the end of the GSOC.
+                        -
+                        -
                         -Testing
                         -For such a tool, since the smallest crack could compromise the user, testing is critically important. I plan to implement two main kind of testings :
+                        -
                         -Unit tests
+                        -
                         -To test the proper working of the meta accessing library. Maintaining a collection of files with metas for every format we should support. Using Python's unittest I can setup expectations to make sure that the library will finds the good fields.
+                        -
                         -End-to-End testing:
+                        -
                         -A script (or set of script) to test the proper working of the command line tool. One way is to run the tool in batch mode on the input test files set. If in the output we are still able to find metas, the system is not doing his job right.
+                        -
                         -It is possible to write this script in Python, possibly using again the unittest lib (even if here the goal is to execute an external executable and see how it's behaving) to get output of the results that is consistent with the unit tests.
+                        -
                         -The goal is to be able to automatically test the whole system in a few commands (one for the unit tests one for the end-to-end tests). It will be easy to add a new document in the test set, so if someone from the community provides a file that was not cleaned properly, we can easily reproduce the problem  and then decide on the proper action to take. The ability to tests everything easily might help a user to make sure that everything works fine on his OS/architecture. Also, if for one reason after some months it is decided to change from Hachoir to another underlying library (yes a pretty radical decision, just given as example) we still have the tests, so valuable to check than everything is still working with the new library.
+                        -
                         -Timeline:
+                        -
                         -    * Community Bonding Period (in order of importance)
                         -          o Playing around with pygobject
                         -          o Playing with Hachoir
                         -          o Learning git
+                        -
                         -    * First two weeks :
                         -          o create the structure in the repository (directories, README, ..)
                         -          o Create a skeleton
+                        -
                         -          o Objectives : to have a deployable working system as soon as possible(even if the list of features is ridiculous). So that I can show you my work in an incremental way thereafter and get feedbacks early.
                         -          o The lib will handle reading/writing EXIF fields (using Hachoir)
                         -          o A set of tests files (and automated unit tests) to demonstrate that the lib does the job
                         -          o The beginning of the command line tool, at this point must list and delete EXIF meta
                         -          o An automated end-to-end test to show that the command line tool does properly remove the EXIF
+                        -
                         -After this first step (making the skeleton) I should be able to deliver a working system right after adding each of the following features. I Hope to get feedbacks so can fix problems quickly
+                        -
                         -    * 3 weeks
                         -          o adding support for (in order of importance) pdf, zip/tar/bzip (just the meta, not the content yet), jpeg/png/bmp, ogg/mpeg1-2-3, exe...
                         -          o For every type of meta, that involves :
                         -                + Creating some input test files with meta data
                         -                + Implementing the feature in the library
                         -                + Asserting that the lib does the job with unit tests
                         -                + Modifying the cmd line tool to support the feature (if necessary)
                         -                + Checking that the cmd line tool can properly delete this type of meta with automated end-to-end test
+                        -
                         -    * about one day
                         -          o Enable the command line tool to set a specific meta to a chosen value
+                        -
                         -    * about 1 day
                         -          o Implementation of the “batch mode” in the cmdline tool, to clean a whole folder
                         -          o Implementation of secure removal
+                        -
                         -    * about 2 days :
                         -          o Add support for deep archive cleanup
                         -                + Clean the content of the archives
                         -                + Make a list of non supported format, for which we warn the user that only the container can be cleaned from meta, not the content (at first that will include rar, 7zip, ..)
                         -                + The supported formats  will be those  supported natively by  Python ( bzip2, gzip, tar )
                         -                + Create some test archives for each supported format containing various files with metas
                         -                + Implement the deep cleanup for the format
                         -                + Assert that the command line passes the end-to-end tests (that is, it can correctly clean the content of the test archives)
+                        -
                         -    * about 2 days
                         -          o Add support for complete deletion of the original files
                         -          o Make a binding nice for shred (should not be to hard using Python)
                         -          o Implement the feature in the command line tool
+                        -
                         -    * 3 weeks
                         -          o Implementation of the GUI tool
                         -          o At this stage, I can use the experience from implementing the cmdline tool to implement the GUI tool, having the same features.
+                        -
                         -    * 1 week
                         -          o Add support for more format (might be based on requests from the community)
+                        -
                         -    * Remaining weeks
                         -          o I want to keep those remaining week in case of problems, and for
                         -                + Remaining/polishing cleanup
                         -                + Bugfixing
                         -                + Integration work
                         -                + Missing features
                         -                + Packaging
                         -                + Final documentation
                         -    * Every Week-end :
                         -          o Documentation time : both end-user, and design. I do not like to document my code while I'm coding it : it slows a lot the development process, but it's not a good thing to delay it too much : week-ends seems fine for this.
                         -          o A blog-post, and a mail on the mailing list about what I have done in the week.
+                        -
                         -About the anonymisation process :
+                        -
                         -Since I'll relay on Hachoir in a first time, I don't know how much it's
                         -effective on every case.
                         -I am planing to do some tests during the Community Bonding Period.
+                        -
                         -The plan is to first rely on the capabilities of Hachoir. I don't know yet
                         -how effective it is for each case. I am planing to do some tests during the
                         -Community Bonding Period. Following the test strategy described before, it
                         -will be easy to add a new document in the test set. If I could make a test
                         -file with metas not supported by Hachoir (or someone from the community
                         -provide such a file), we could then decide on the proper action : propose a
                         -patch to Hachoir or use another library for this specific format.
+                        -
                         -Doing R&D about supporting exotics fields, or improve existing support will
                         -depend of my progress. Speaking of the fingerprints, the subject of the
                         -project is “metadata anonymisation”, and not “fingerprint detection” : they
                         -are too many softwares, too many subtle ways to alter/mark a file (and I
                         -don't even speak of steganography).
+                        -
                         -So, nop, I'm not planing to implement fingerprinting detection. Not that it
                         -is not interesting, it's just that it's not realistic to support it
                         -correctly (actually it's not realistic to support it at all, given that I
                         -must first support the metas, in such a short time frame). Would you agree
                         -that it is better to first focus on making a working tool that does the job
                         -for the known metadatas ?
+                        -
                         -That done, if the design is good we could easily add support for more exotic
                         -fields (or some kind of fingerprinting). I think we should never loose track
                         -of the sad truth : no matter how big the effort we spend on this, making a
                         -comprehensive tool able of detecting every kind of meta and every pattern of
                         -fingerprinting is just not feasible.
+                        -
                         -About archives anonymisation
+                        -
                         -Task order for anonymisation of an archive :
+                        -
                         -         1. Extract the archive
                         -         2. Anonymise his content
                         -         3. Recompress the archive
                         -         4. Securely deleting the extracted content
                         -         5. Anonymise the new created archive
+                        -
                         -And I'm planing to support the extraction only of archive formats supported by python (without this limitation, the tool will not be portable). If the user try to anonymize a .rar for example, the tool will popup “WARNING - only the container will be anonymize, and not the content - WARNING”
+                        -
                         -As for what I expect from my mentor, I think he should try to be available (not immediately but in the 48 hours) when I need him specifically (e.g. technical questions no one else on IRC can answer) but he doesn't need to check on me all the time. I'm fine with mails to (since intrigeri is not always “online”, it will be I think, the best solution to communicate with him). I'd like to do a weekly reunion, on irc or jabber, to discuss things more smoothly.
+                        -
                         -I'm using irc quite a lot, and I'm hanging on #tor, and #tor-dev (nick : jvoisin).
                         -I'm planing to do a blogpost every week-end, about the advancement of the project.
                         -You can mail me at : julien.voisin@dustri.org for more information.
+                        -

...	...	@@ -220,7 +220,7 @@
220	220	<li><h4><a href="http://inspirated.com/uploads/tor-gsoc-11.pdf">GTK+ Frontend and Client Mode Improvements for arm</a> by Kamran Khan</h4></li>
221	221	<li><h4><a href="http://www.gsathya.in/gsoc11.html">Orbot + ORLib</a> by Sathya Gunasekaran</h4></li>
222	222	<li><h4><a href="http://blanu.net/TorSummerOfCodeProposal.pdf">Blocking-resistant Transport Evaluation Framework</a> by Brandon Wiley</h4></li>
223		- <li><h4><a href="../about/gsocProposal/gsoc11-proposal-metadataToolkit.txt">Metadata Anonymisation Toolkit</a> by Julien Voisin</h4></li>
	223	+ <li><h4><a href="../about/gsocProposal/gsoc11-proposal-metadataToolkit.pdf">Metadata Anonymisation Toolkit</a> by Julien Voisin</h4></li>
224	224	<li><h4><a href="http://www.atagar.com/misc/gsocBlog09/">Website Pootle Translation</a> by Damian Johnson</h4></li>
225	225	</ul>
226	226